[jira] [Commented] (MESOS-5330) Agent should backoff before connecting to the master
[ https://issues.apache.org/jira/browse/MESOS-5330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15277594#comment-15277594 ] David Robinson commented on MESOS-5330: --- https://reviews.apache.org/r/47154/ > Agent should backoff before connecting to the master > > > Key: MESOS-5330 > URL: https://issues.apache.org/jira/browse/MESOS-5330 > Project: Mesos > Issue Type: Bug >Reporter: David Robinson >Assignee: David Robinson > > When an agent is started it starts a background task (libprocess process?) to > detect the leading master. When the leading master is detected (or changes) > the [SocketManager's link() method is called and a TCP connection to the > master is > established|https://github.com/apache/mesos/blob/a138e2246a30c4b5c9bc3f7069ad12204dcaffbc/src/slave/slave.cpp#L954]. > The agent _then_ backs off before sending a ReRegisterSlave message via the > newly established connection. The agent needs to backoff _before_ attempting > to establish a TCP connection to the master, not before sending the first > message over the connection. > During scale tests at Twitter we discovered that agents can SYN flood the > master upon leader changes, then the problem described in MESOS-5200 can > occur where ephemeral connections are used, which exacerbates the problem. > The end result is a lot of hosts setting up and tearing down TCP connections > every slave_ping_timeout seconds (15 by default), connections failing to be > established, hosts being marked as unhealthy and being shutdown. We observed > ~800 passive TCP connections per second on the leading master during scale > tests. > The problem can be somewhat mitigated by tuning the kernel to handle a > thundering herd of TCP connections, but ideally there would not be a > thundering herd to begin with. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (MESOS-5330) Agent should backoff before connecting to the master
[ https://issues.apache.org/jira/browse/MESOS-5330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Robinson updated MESOS-5330: -- Comment: was deleted (was: https://reviews.apache.org/r/47080/) > Agent should backoff before connecting to the master > > > Key: MESOS-5330 > URL: https://issues.apache.org/jira/browse/MESOS-5330 > Project: Mesos > Issue Type: Bug >Reporter: David Robinson >Assignee: David Robinson > > When an agent is started it starts a background task (libprocess process?) to > detect the leading master. When the leading master is detected (or changes) > the [SocketManager's link() method is called and a TCP connection to the > master is > established|https://github.com/apache/mesos/blob/a138e2246a30c4b5c9bc3f7069ad12204dcaffbc/src/slave/slave.cpp#L954]. > The agent _then_ backs off before sending a ReRegisterSlave message via the > newly established connection. The agent needs to backoff _before_ attempting > to establish a TCP connection to the master, not before sending the first > message over the connection. > During scale tests at Twitter we discovered that agents can SYN flood the > master upon leader changes, then the problem described in MESOS-5200 can > occur where ephemeral connections are used, which exacerbates the problem. > The end result is a lot of hosts setting up and tearing down TCP connections > every slave_ping_timeout seconds (15 by default), connections failing to be > established, hosts being marked as unhealthy and being shutdown. We observed > ~800 passive TCP connections per second on the leading master during scale > tests. > The problem can be somewhat mitigated by tuning the kernel to handle a > thundering herd of TCP connections, but ideally there would not be a > thundering herd to begin with. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15277516#comment-15277516 ] Joseph Wu commented on MESOS-5342: -- We only use Github PR's for website/UI related changes. Everything else needs to go through ReviewBoard. > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, specifically in terms of cache locality. > Applications requiring GPU resources can benefit from this feature by getting > access to cores closest to the GPU hardware, which reduces cpu-gpu copy > latency. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology, and the advent of Intel's Knights-series processors, will require > making choices about where container's are going to run on the mesos-agent's > processor(s) cores - this feature is a step toward developing a robust > solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5354) Update "driver" as optional for DockerVolume.
[ https://issues.apache.org/jira/browse/MESOS-5354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu updated MESOS-5354: --- Description: After some test with docker API, I found that when "docker run" to create a container, the volume name is required but volume driver is optional. When using "dvdcli", both name and driver are required. We are now defining the "driver" as required, we should update "driver" to optional so that the DockerContainerizer still works even if user did not specify driver when creating a container with volume. {code} message DockerVolume { // Driver of the volume, it can be flocker, convoy, raxrey etc. required string driver = 1; Update "driver" as optional for DockerVolume. > - > > Key: MESOS-5354 > URL: https://issues.apache.org/jira/browse/MESOS-5354 > Project: Mesos > Issue Type: Bug >Reporter: Guangya Liu >Assignee: Guangya Liu > > After some test with docker API, I found that when "docker run" to create a > container, the volume name is required but volume driver is optional. When > using "dvdcli", both name and driver are required. We are now defining the > "driver" as required, we should update "driver" to optional so that the > DockerContainerizer still works even if user did not specify driver when > creating a container with volume. > {code} > message DockerVolume { > // Driver of the volume, it can be flocker, convoy, raxrey etc. > required string driver = 1; // Name of the volume. > required string name = 2; > // Volume driver specific options. > optional Parameters driver_options = 3; > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5354) Update "driver" as optional for DockerVolume.
Guangya Liu created MESOS-5354: -- Summary: Update "driver" as optional for DockerVolume. Key: MESOS-5354 URL: https://issues.apache.org/jira/browse/MESOS-5354 Project: Mesos Issue Type: Bug Reporter: Guangya Liu Assignee: Guangya Liu After some test with docker API, I found that when "docker run" to create a container, the volume name is required but volume driver is optional. When using "dvdcli", both name and driver are required. We are now defining the "driver" as required, we should update "driver" to optional so that the DockerContainerizer still works even if user did not specify driver when creating a container with volume. {code} message DockerVolume { // Driver of the volume, it can be flocker, convoy, raxrey etc. required string driver = 1; < Shall we update this to optional? // Name of the volume. required string name = 2; // Volume driver specific options. optional Parameters driver_options = 3; } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15277463#comment-15277463 ] Chris commented on MESOS-5342: -- [~kaysoky] Sure thing - I've done some of this work prior to writing the code in a local README. Shouldn't be too much trouble transposing that information onto googledocs. Oh, should the source be posted on github under a separate branch for review? > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, specifically in terms of cache locality. > Applications requiring GPU resources can benefit from this feature by getting > access to cores closest to the GPU hardware, which reduces cpu-gpu copy > latency. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology, and the advent of Intel's Knights-series processors, will require > making choices about where container's are going to run on the mesos-agent's > processor(s) cores - this feature is a step toward developing a robust > solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5353) Use `Connection` abstraction to compare stale connections in scheduler library.
Anand Mazumdar created MESOS-5353: - Summary: Use `Connection` abstraction to compare stale connections in scheduler library. Key: MESOS-5353 URL: https://issues.apache.org/jira/browse/MESOS-5353 Project: Mesos Issue Type: Improvement Reporter: Anand Mazumdar Priority: Minor Previously, we had a bug in the {{Connection}} abstraction in libprocess that hindered the ability to pass it onto {{defer}} callbacks since it could sometimes lead to deadlock (MESOS-4658). Now that it is resolved, we might consider not using {{UUID}} objects for stale connection checks but directly using the {{Connection}} abstraction in the scheduler library. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1575) master sets failover timeout to 0 when framework requests a high value
[ https://issues.apache.org/jira/browse/MESOS-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15277406#comment-15277406 ] José Guilherme Vanz commented on MESOS-1575: [~neilc], are there something more to change in the patch? > master sets failover timeout to 0 when framework requests a high value > -- > > Key: MESOS-1575 > URL: https://issues.apache.org/jira/browse/MESOS-1575 > Project: Mesos > Issue Type: Bug >Reporter: Kevin Sweeney >Assignee: José Guilherme Vanz > Labels: newbie, twitter > > In response to a registered RPC we observed the following behavior: > {noformat} > W0709 19:07:32.982997 11400 master.cpp:612] Using the default value for > 'failover_timeout' becausethe input value is invalid: Argument out of the > range that a Duration can represent due to int64_t's size limit > I0709 19:07:32.983008 11404 hierarchical_allocator_process.hpp:408] > Deactivated framework 20140709-184342-119646400-5050-11380-0003 > I0709 19:07:32.983013 11400 master.cpp:617] Giving framework > 20140709-184342-119646400-5050-11380-0003 0ns to failover > I0709 19:07:32.983271 11404 master.cpp:2201] Framework failover timeout, > removing framework 20140709-184342-119646400-5050-11380-0003 > I0709 19:07:32.983294 11404 master.cpp:2688] Removing framework > 20140709-184342-119646400-5050-11380-0003 > I0709 19:07:32.983678 11404 hierarchical_allocator_process.hpp:363] Removed > framework 20140709-184342-119646400-5050-11380-0003 > {noformat} > This was using the following frameworkInfo. > {code} > FrameworkInfo frameworkInfo = FrameworkInfo.newBuilder() > .setUser("test") > .setName("jvm") > .setFailoverTimeout(Long.MAX_VALUE) > .build(); > {code} > Instead of silently defaulting large values to 0 the master should refuse to > process the request. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5352) Docker volume isolator cleanup can be blocked by first cleanup failure.
Gilbert Song created MESOS-5352: --- Summary: Docker volume isolator cleanup can be blocked by first cleanup failure. Key: MESOS-5352 URL: https://issues.apache.org/jira/browse/MESOS-5352 Project: Mesos Issue Type: Bug Components: containerization Reporter: Gilbert Song The summary title may be confusing, please look at the description below for details. Some background: 1). In docker volume isolator cleanup, currently we do reference counting for docker volumes. Volume driver `unmount` will only be called if the ref count is 1. 2). We have built a hash map `infos` to track on docker volume mount information for one specific containerId. And a containerId will be erased form the hash map only if all driver `unmount` calls success (each subprocess return a ready future). The issue in this JIRA is that if we have a slave running (not shut down or reboot in this case), then keep launching frameworks which make use of docker volumes. Once any docker volume isolator cleanup returns a failure, all the other `unmount` calls to these volumes will be blocked by the reference count, since the `_cleanup()` returns a failure and the containerId in the hash map `infos` is not erased even through all volume may be unmounted/detached correctly. (docker volume isolator calls driver unmount as a subprocess, and a failure message may be possibly returned by the driver even if all volumes are unmount/detached correctly). Then, the extra containerId in infos could make all other isolator cleanup calls to contain one extra volume when doing the reference counting, which mean it rejects to call driver unmount. So after all tasks finish, all those docker volumes from the first failure will still with the `attached` status. This issue will be gone after the slave recover, but we cannot rely on restarting the slave every time hitting this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15277245#comment-15277245 ] Joseph Wu commented on MESOS-5342: -- You can post a link to the document as a JIRA link (we usually use Google Docs, but anything will work). > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, specifically in terms of cache locality. > Applications requiring GPU resources can benefit from this feature by getting > access to cores closest to the GPU hardware, which reduces cpu-gpu copy > latency. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology, and the advent of Intel's Knights-series processors, will require > making choices about where container's are going to run on the mesos-agent's > processor(s) cores - this feature is a step toward developing a robust > solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5351) DockerVolumeIsolatorTest.ROOT_INTERNET_CURL_CommandTaskRootfsWithVolumes is flaky
Joseph Wu created MESOS-5351: Summary: DockerVolumeIsolatorTest.ROOT_INTERNET_CURL_CommandTaskRootfsWithVolumes is flaky Key: MESOS-5351 URL: https://issues.apache.org/jira/browse/MESOS-5351 Project: Mesos Issue Type: Bug Components: test Environment: GCC 4.9 CentOS 7 and Fedora 23 (Both SSL or no-SSL) Reporter: Joseph Wu Consistently fails on Mesosphere internal CI: {code} [14:38:12] : [Step 10/10] [ RUN ] DockerVolumeIsolatorTest.ROOT_INTERNET_CURL_CommandTaskRootfsWithVolumes [14:38:12]W: [Step 10/10] I0509 14:38:12.782032 2386 cluster.cpp:149] Creating default 'local' authorizer [14:38:12]W: [Step 10/10] I0509 14:38:12.786592 2386 leveldb.cpp:174] Opened db in 4.462265ms [14:38:12]W: [Step 10/10] I0509 14:38:12.787979 2386 leveldb.cpp:181] Compacted db in 1.368995ms [14:38:12]W: [Step 10/10] I0509 14:38:12.788007 2386 leveldb.cpp:196] Created db iterator in 4994ns [14:38:12]W: [Step 10/10] I0509 14:38:12.788014 2386 leveldb.cpp:202] Seeked to beginning of db in 724ns [14:38:12]W: [Step 10/10] I0509 14:38:12.788019 2386 leveldb.cpp:271] Iterated through 0 keys in the db in 388ns [14:38:12]W: [Step 10/10] I0509 14:38:12.788031 2386 replica.cpp:779] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned [14:38:12]W: [Step 10/10] I0509 14:38:12.788249 2402 recover.cpp:447] Starting replica recovery [14:38:12]W: [Step 10/10] I0509 14:38:12.788316 2402 recover.cpp:473] Replica is in EMPTY status [14:38:12]W: [Step 10/10] I0509 14:38:12.788684 2406 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (18057)@172.30.2.145:48816 [14:38:12]W: [Step 10/10] I0509 14:38:12.788744 2405 recover.cpp:193] Received a recover response from a replica in EMPTY status [14:38:12]W: [Step 10/10] I0509 14:38:12.788869 2400 recover.cpp:564] Updating replica status to STARTING [14:38:12]W: [Step 10/10] I0509 14:38:12.789206 2406 master.cpp:383] Master 6c04237d-91d6-4a05-849a-8b46fdeafe76 (ip-172-30-2-145.mesosphere.io) started on 172.30.2.145:48816 [14:38:12]W: [Step 10/10] I0509 14:38:12.789216 2406 master.cpp:385] Flags at startup: --acls="" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="true" --authenticate_http="true" --authenticate_http_frameworks="true" --authenticate_slaves="true" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/vepf2X/credentials" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" --quiet="false" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="100secs" --registry_strict="true" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/vepf2X/master" --zk_session_timeout="10secs" [14:38:12]W: [Step 10/10] I0509 14:38:12.789342 2406 master.cpp:434] Master only allowing authenticated frameworks to register [14:38:12]W: [Step 10/10] I0509 14:38:12.789348 2406 master.cpp:440] Master only allowing authenticated agents to register [14:38:12]W: [Step 10/10] I0509 14:38:12.789351 2406 master.cpp:446] Master only allowing authenticated HTTP frameworks to register [14:38:12]W: [Step 10/10] I0509 14:38:12.789355 2406 credentials.hpp:37] Loading credentials for authentication from '/tmp/vepf2X/credentials' [14:38:12]W: [Step 10/10] I0509 14:38:12.789466 2406 master.cpp:490] Using default 'crammd5' authenticator [14:38:12]W: [Step 10/10] I0509 14:38:12.789504 2406 master.cpp:561] Using default 'basic' HTTP authenticator [14:38:12]W: [Step 10/10] I0509 14:38:12.789540 2406 master.cpp:641] Using default 'basic' HTTP framework authenticator [14:38:12]W: [Step 10/10] I0509 14:38:12.789599 2406 master.cpp:688] Authorization enabled [14:38:12]W: [Step 10/10] I0509 14:38:12.789669 2402 hierarchical.cpp:142] Initialized hierarchical allocator process [14:38:12]W: [Step 10/10] I0509 14:38:12.789691 2407 whitelist_watcher.cpp:77] No whitelist given [14:38:12]W: [Step 10/10] I0509 14:38:12.790190 2403 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 1.259226ms [14:38:12]W: [Step 10/10] I0509 14:38:12.790207 2403 replica.cpp:320] Persisted replica status to STARTING [14:38:12]W: [Step 10/10] I0509 14:38:12.790297 2406 master.cpp:1939] The newly elected leader is master@172.30.2.145:48816 with id 6c04237d-91d6-4a05-849a-8b46fdeafe76 [14:38:12]W: [Step
[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15277207#comment-15277207 ] Chris commented on MESOS-5342: -- [~kaysoky] where are design documents supposed to be posted? I've gone through the patch submission documentation and will review the testing documentation and style guides. > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, specifically in terms of cache locality. > Applications requiring GPU resources can benefit from this feature by getting > access to cores closest to the GPU hardware, which reduces cpu-gpu copy > latency. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology, and the advent of Intel's Knights-series processors, will require > making choices about where container's are going to run on the mesos-agent's > processor(s) cores - this feature is a step toward developing a robust > solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5350) Add asynchronous hook for validating docker containerizer tasks
Joseph Wu created MESOS-5350: Summary: Add asynchronous hook for validating docker containerizer tasks Key: MESOS-5350 URL: https://issues.apache.org/jira/browse/MESOS-5350 Project: Mesos Issue Type: Improvement Components: docker, modules Reporter: Joseph Wu Assignee: Joseph Wu Priority: Minor It is possible to plug in custom validation logic for the MesosContainerizer via an {{Isolator}} module, but the same is not true of the DockerContainerizer. Basic logic can be plugged into the DockerContainerizer via {{Hooks}}, but this has some notable differences compared to isolators: * Hooks are synchronous. * Modifications to tasks via Hooks have lower priority compared to the task itself. i.e. If both the {{TaskInfo}} and {{slaveExecutorEnvironmentDecorator}} define the same environment variable, the {{TaskInfo}} wins. * Hooks have no effect if they fail (short of segfaulting) i.e. The {{slavePreLaunchDockerHook}} has a return type of {{Try}}: https://github.com/apache/mesos/blob/628ccd23501078b04fb21eee85060a6226a80ef8/include/mesos/hook.hpp#L90 But the effect of returning an {{Error}} is a log message: https://github.com/apache/mesos/blob/628ccd23501078b04fb21eee85060a6226a80ef8/src/hook/manager.cpp#L227-L230 We should add a hook to the DockerContainerizer to narrow this gap. This new hook would: * Be called at roughly the same place as {{slavePreLaunchDockerHook}} https://github.com/apache/mesos/blob/628ccd23501078b04fb21eee85060a6226a80ef8/src/slave/containerizer/docker.cpp#L1022 * Return a {{Future}} and require splitting up {{DockerContainerizer::launch}}. * Prevent a task from launching if it returns a {{Failure}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15277056#comment-15277056 ] Jie Yu commented on MESOS-5342: --- +1 Sending a patch to RB without a design (for a non trivial feature) should be avoided. > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, specifically in terms of cache locality. > Applications requiring GPU resources can benefit from this feature by getting > access to cores closest to the GPU hardware, which reduces cpu-gpu copy > latency. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology, and the advent of Intel's Knights-series processors, will require > making choices about where container's are going to run on the mesos-agent's > processor(s) cores - this feature is a step toward developing a robust > solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276992#comment-15276992 ] Joseph Wu commented on MESOS-5342: -- Ideally (and especially for new contributors), you should find a shepherd _before_ starting work on an issue, which will save you time in the long-run. I would recommend taking some time and reading some of our contribution guides: * http://mesos.apache.org/documentation/latest/c++-style-guide/ * http://mesos.apache.org/documentation/latest/submitting-a-patch/ * http://mesos.apache.org/documentation/latest/testing-patterns/ It would also help to have a design document that describes the goal and some implementation decisions you've made. > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, specifically in terms of cache locality. > Applications requiring GPU resources can benefit from this feature by getting > access to cores closest to the GPU hardware, which reduces cpu-gpu copy > latency. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology, and the advent of Intel's Knights-series processors, will require > making choices about where container's are going to run on the mesos-agent's > processor(s) cores - this feature is a step toward developing a robust > solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5349) A large number of tasks stuck in Staging state.
[ https://issues.apache.org/jira/browse/MESOS-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-5349: -- Description: We saw a weird issue happening on one of our test clusters over the weekend. A large number of tasks from the example {{long running framework}} were stuck in staging. The executor was duly sending status updates for all the tasks and the slave successfully received the status update as seen from the logs but for some reason never got to checkpointing them. >From the agent logs, it seems that it kept on retrying some backlogged status >updates starting with the 4xxx/6xxx range while the present tasks were >launched in the 8xxx range. (task ID) The issue resolved itself after a few hours upon the agent (re-)registering with the master upon loosing its ZK session. Let's take a timeline of a particular task 8142. Agent logs before restart {code} May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.204941 2820 slave.cpp:1522] Got assigned task 8142 for framework ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205142 2820 slave.cpp:1641] Launching task 8142 for framework ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205656 2820 slave.cpp:1880] Queuing task '8142' for executor 'default' of framework ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP) May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.206092 2818 disk.cpp:169] Updating the disk resources for container f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.101; mem(*):33 May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207093 2816 mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 33MB for container f68f137c-b101-4f9f-8de4-f50eae27e969 May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207293 2821 cpushare.cpp:389] Updated 'cpu.shares' to 103 (cpus 0.101) for container f68f137c-b101-4f9f-8de4-f50eae27e969 May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208742 2821 cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' to 10100us (cpus 0.101) for container f68f137c-b101-4f9f-8de4-f50eae27e969 May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208902 2818 slave.cpp:2032] Sending queued task '8142' to executor 'default' of framework ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP) May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210290 2821 http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921 May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210357 2821 slave.cpp:3221] Handling status update TASK_RUNNING (UUID: 85323c4f-e523-495e-9b49-39b0a7792303) for task 8142 of framework ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213770 2817 http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921 May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213882 2817 slave.cpp:3221] Handling status update TASK_FINISHED (UUID: 285b73e1-7f5a-43e5-8385-7b76e0fbdad4) for task 8142 of framework ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.214787 2821 disk.cpp:169] Updating the disk resources for container f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.1; mem(*):32 May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215365 2823 cpushare.cpp:389] Updated 'cpu.shares' to 102 (cpus 0.1) for container f68f137c-b101-4f9f-8de4-f50eae27e969 May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215641 2820 mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 32MB for container f68f137c-b101-4f9f-8de4-f50eae27e969 May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.216878 2823 cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' to 10ms (cpus 0.1) for container f68f137c-b101-4f9f-8de4-f50eae27e969 {code} Agent logs for this task upon restart: {code} May 09 15:22:14 ip-10-10-0-6 mesos-slave[14314]: W0509 15:22:14.083993 14318 state.cpp:606] Failed to find status updates file '/var/lib/mesos/slave/meta/slaves/ad2ee74e-24f1-4381-be9a-1af70ba1ced0-S1/frameworks/ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003/executors/default/runs/f68f137c-b101-4f9f-8de4-f50eae27e969/tasks/8142/task.updates' {code} Things that need to be investigated: - Why couldn't the agent get around to handling the status updates from the executor i.e. even checkpointing them? - What made the agent get _so_ backlogged on the status updates i.e. why it kept resending the old status updates for the 4/6 tasks without getting around to the newer tasks. PFA the agent/master logs. This is running against Mesos {{HEAD -> 557cab591f35a6c3d2248d7af7f06cdf99726e92}} was: We
[jira] [Updated] (MESOS-5349) A large number of tasks stuck in Staging state.
[ https://issues.apache.org/jira/browse/MESOS-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-5349: -- Attachment: log_agent_before_zk_disconnect.gz > A large number of tasks stuck in Staging state. > --- > > Key: MESOS-5349 > URL: https://issues.apache.org/jira/browse/MESOS-5349 > Project: Mesos > Issue Type: Bug > Components: slave >Affects Versions: 0.29.0 >Reporter: Anand Mazumdar > Labels: mesosphere > Attachments: agent-state.log, log_agent_after_zk_disconnect.gz, > log_agent_before_zk_disconnect.gz, master-state.log, mesos-master.WARNING, > mesos-slave.WARNING, staging_tasks.png > > > We saw a weird issue happening on one of our test clusters over the weekend. > A large number of tasks from the example {{long running framework}} were > stuck in staging. The executor was duly sending status updates for all the > tasks and the slave successfully received the status update as seen from the > logs but for some reason never got to checkpointing them. > From the agent logs, it seems that it kept on retrying some backlogged status > updates starting with the 4xxx/6xxx range while the present tasks were > launched in the 8xxx range. (task ID) > The issue resolved itself after a few hours upon the agent (re-)registering > with the master upon loosing its ZK session. > Let's take a timeline of a particular task 8142. > Agent logs before restart > {code} > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.204941 2820 > slave.cpp:1522] Got assigned task 8142 for framework > ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205142 2820 > slave.cpp:1641] Launching task 8142 for framework > ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205656 2820 > slave.cpp:1880] Queuing task '8142' for executor 'default' of framework > ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP) > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.206092 2818 > disk.cpp:169] Updating the disk resources for container > f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.101; mem(*):33 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207093 2816 > mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 33MB for container > f68f137c-b101-4f9f-8de4-f50eae27e969 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207293 2821 > cpushare.cpp:389] Updated 'cpu.shares' to 103 (cpus 0.101) for container > f68f137c-b101-4f9f-8de4-f50eae27e969 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208742 2821 > cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' > to 10100us (cpus 0.101) for container f68f137c-b101-4f9f-8de4-f50eae27e969 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208902 2818 > slave.cpp:2032] Sending queued task '8142' to executor 'default' of framework > ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP) > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210290 2821 > http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210357 2821 > slave.cpp:3221] Handling status update TASK_RUNNING (UUID: > 85323c4f-e523-495e-9b49-39b0a7792303) for task 8142 of framework > ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 > May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213770 2817 > http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921 > May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213882 2817 > slave.cpp:3221] Handling status update TASK_FINISHED (UUID: > 285b73e1-7f5a-43e5-8385-7b76e0fbdad4) for task 8142 of framework > ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 > May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.214787 2821 > disk.cpp:169] Updating the disk resources for container > f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.1; mem(*):32 > May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215365 2823 > cpushare.cpp:389] Updated 'cpu.shares' to 102 (cpus 0.1) for container > f68f137c-b101-4f9f-8de4-f50eae27e969 > May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215641 2820 > mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 32MB for container > f68f137c-b101-4f9f-8de4-f50eae27e969 > May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.216878 2823 > cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' > to 10ms (cpus 0.1) for container f68f137c-b101-4f9f-8de4-f50eae27e969 > {code} > Agent logs for this task upon restart: > {code} > May 09 15:22:14 ip-10-10-0-6 mesos-slave[14314]: W0509
[jira] [Issue Comment Deleted] (MESOS-5349) A large number of tasks stuck in Staging state.
[ https://issues.apache.org/jira/browse/MESOS-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-5349: -- Comment: was deleted (was: Trying to upload the detailed master/agent logs somewhere else as they are rather large.) > A large number of tasks stuck in Staging state. > --- > > Key: MESOS-5349 > URL: https://issues.apache.org/jira/browse/MESOS-5349 > Project: Mesos > Issue Type: Bug > Components: slave >Affects Versions: 0.29.0 >Reporter: Anand Mazumdar > Labels: mesosphere > Attachments: agent-state.log, log_agent_after_zk_disconnect.gz, > log_agent_before_zk_disconnect.gz, master-state.log, mesos-master.WARNING, > mesos-slave.WARNING, staging_tasks.png > > > We saw a weird issue happening on one of our test clusters over the weekend. > A large number of tasks from the example {{long running framework}} were > stuck in staging. The executor was duly sending status updates for all the > tasks and the slave successfully received the status update as seen from the > logs but for some reason never got to checkpointing them. > From the agent logs, it seems that it kept on retrying some backlogged status > updates starting with the 4xxx/6xxx range while the present tasks were > launched in the 8xxx range. (task ID) > The issue resolved itself after a few hours upon the agent (re-)registering > with the master upon loosing its ZK session. > Let's take a timeline of a particular task 8142. > Agent logs before restart > {code} > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.204941 2820 > slave.cpp:1522] Got assigned task 8142 for framework > ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205142 2820 > slave.cpp:1641] Launching task 8142 for framework > ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205656 2820 > slave.cpp:1880] Queuing task '8142' for executor 'default' of framework > ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP) > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.206092 2818 > disk.cpp:169] Updating the disk resources for container > f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.101; mem(*):33 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207093 2816 > mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 33MB for container > f68f137c-b101-4f9f-8de4-f50eae27e969 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207293 2821 > cpushare.cpp:389] Updated 'cpu.shares' to 103 (cpus 0.101) for container > f68f137c-b101-4f9f-8de4-f50eae27e969 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208742 2821 > cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' > to 10100us (cpus 0.101) for container f68f137c-b101-4f9f-8de4-f50eae27e969 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208902 2818 > slave.cpp:2032] Sending queued task '8142' to executor 'default' of framework > ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP) > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210290 2821 > http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210357 2821 > slave.cpp:3221] Handling status update TASK_RUNNING (UUID: > 85323c4f-e523-495e-9b49-39b0a7792303) for task 8142 of framework > ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 > May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213770 2817 > http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921 > May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213882 2817 > slave.cpp:3221] Handling status update TASK_FINISHED (UUID: > 285b73e1-7f5a-43e5-8385-7b76e0fbdad4) for task 8142 of framework > ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 > May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.214787 2821 > disk.cpp:169] Updating the disk resources for container > f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.1; mem(*):32 > May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215365 2823 > cpushare.cpp:389] Updated 'cpu.shares' to 102 (cpus 0.1) for container > f68f137c-b101-4f9f-8de4-f50eae27e969 > May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215641 2820 > mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 32MB for container > f68f137c-b101-4f9f-8de4-f50eae27e969 > May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.216878 2823 > cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' > to 10ms (cpus 0.1) for container f68f137c-b101-4f9f-8de4-f50eae27e969 > {code} > Agent logs for this task upon restart: >
[jira] [Updated] (MESOS-5349) A large number of tasks stuck in Staging state.
[ https://issues.apache.org/jira/browse/MESOS-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-5349: -- Attachment: log_agent_after_zk_disconnect.gz > A large number of tasks stuck in Staging state. > --- > > Key: MESOS-5349 > URL: https://issues.apache.org/jira/browse/MESOS-5349 > Project: Mesos > Issue Type: Bug > Components: slave >Affects Versions: 0.29.0 >Reporter: Anand Mazumdar > Labels: mesosphere > Attachments: agent-state.log, log_agent_after_zk_disconnect.gz, > master-state.log, mesos-master.WARNING, mesos-slave.WARNING, staging_tasks.png > > > We saw a weird issue happening on one of our test clusters over the weekend. > A large number of tasks from the example {{long running framework}} were > stuck in staging. The executor was duly sending status updates for all the > tasks and the slave successfully received the status update as seen from the > logs but for some reason never got to checkpointing them. > From the agent logs, it seems that it kept on retrying some backlogged status > updates starting with the 4xxx/6xxx range while the present tasks were > launched in the 8xxx range. (task ID) > The issue resolved itself after a few hours upon the agent (re-)registering > with the master upon loosing its ZK session. > Let's take a timeline of a particular task 8142. > Agent logs before restart > {code} > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.204941 2820 > slave.cpp:1522] Got assigned task 8142 for framework > ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205142 2820 > slave.cpp:1641] Launching task 8142 for framework > ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205656 2820 > slave.cpp:1880] Queuing task '8142' for executor 'default' of framework > ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP) > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.206092 2818 > disk.cpp:169] Updating the disk resources for container > f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.101; mem(*):33 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207093 2816 > mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 33MB for container > f68f137c-b101-4f9f-8de4-f50eae27e969 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207293 2821 > cpushare.cpp:389] Updated 'cpu.shares' to 103 (cpus 0.101) for container > f68f137c-b101-4f9f-8de4-f50eae27e969 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208742 2821 > cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' > to 10100us (cpus 0.101) for container f68f137c-b101-4f9f-8de4-f50eae27e969 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208902 2818 > slave.cpp:2032] Sending queued task '8142' to executor 'default' of framework > ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP) > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210290 2821 > http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210357 2821 > slave.cpp:3221] Handling status update TASK_RUNNING (UUID: > 85323c4f-e523-495e-9b49-39b0a7792303) for task 8142 of framework > ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 > May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213770 2817 > http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921 > May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213882 2817 > slave.cpp:3221] Handling status update TASK_FINISHED (UUID: > 285b73e1-7f5a-43e5-8385-7b76e0fbdad4) for task 8142 of framework > ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 > May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.214787 2821 > disk.cpp:169] Updating the disk resources for container > f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.1; mem(*):32 > May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215365 2823 > cpushare.cpp:389] Updated 'cpu.shares' to 102 (cpus 0.1) for container > f68f137c-b101-4f9f-8de4-f50eae27e969 > May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215641 2820 > mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 32MB for container > f68f137c-b101-4f9f-8de4-f50eae27e969 > May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.216878 2823 > cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' > to 10ms (cpus 0.1) for container f68f137c-b101-4f9f-8de4-f50eae27e969 > {code} > Agent logs for this task upon restart: > {code} > May 09 15:22:14 ip-10-10-0-6 mesos-slave[14314]: W0509 15:22:14.083993 14318 > state.cpp:606] Failed
[jira] [Updated] (MESOS-5349) A large number of tasks stuck in Staging state.
[ https://issues.apache.org/jira/browse/MESOS-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-5349: -- Attachment: staging_tasks.png > A large number of tasks stuck in Staging state. > --- > > Key: MESOS-5349 > URL: https://issues.apache.org/jira/browse/MESOS-5349 > Project: Mesos > Issue Type: Bug > Components: slave >Affects Versions: 0.29.0 >Reporter: Anand Mazumdar > Labels: mesosphere > Attachments: agent-state.log, master-state.log, mesos-master.WARNING, > mesos-slave.WARNING, staging_tasks.png > > > We saw a weird issue happening on one of our test clusters over the weekend. > A large number of tasks from the example {{long running framework}} were > stuck in staging. The executor was duly sending status updates for all the > tasks and the slave successfully received the status update as seen from the > logs but for some reason never got to checkpointing them. > From the agent logs, it seems that it kept on retrying some backlogged status > updates starting with the 4xxx/6xxx range while the present tasks were > launched in the 8xxx range. (task ID) > The issue resolved itself after a few hours upon the agent (re-)registering > with the master upon loosing its ZK session. > Let's take a timeline of a particular task 8142. > Agent logs before restart > {code} > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.204941 2820 > slave.cpp:1522] Got assigned task 8142 for framework > ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205142 2820 > slave.cpp:1641] Launching task 8142 for framework > ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205656 2820 > slave.cpp:1880] Queuing task '8142' for executor 'default' of framework > ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP) > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.206092 2818 > disk.cpp:169] Updating the disk resources for container > f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.101; mem(*):33 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207093 2816 > mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 33MB for container > f68f137c-b101-4f9f-8de4-f50eae27e969 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207293 2821 > cpushare.cpp:389] Updated 'cpu.shares' to 103 (cpus 0.101) for container > f68f137c-b101-4f9f-8de4-f50eae27e969 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208742 2821 > cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' > to 10100us (cpus 0.101) for container f68f137c-b101-4f9f-8de4-f50eae27e969 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208902 2818 > slave.cpp:2032] Sending queued task '8142' to executor 'default' of framework > ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP) > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210290 2821 > http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210357 2821 > slave.cpp:3221] Handling status update TASK_RUNNING (UUID: > 85323c4f-e523-495e-9b49-39b0a7792303) for task 8142 of framework > ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 > May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213770 2817 > http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921 > May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213882 2817 > slave.cpp:3221] Handling status update TASK_FINISHED (UUID: > 285b73e1-7f5a-43e5-8385-7b76e0fbdad4) for task 8142 of framework > ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 > May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.214787 2821 > disk.cpp:169] Updating the disk resources for container > f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.1; mem(*):32 > May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215365 2823 > cpushare.cpp:389] Updated 'cpu.shares' to 102 (cpus 0.1) for container > f68f137c-b101-4f9f-8de4-f50eae27e969 > May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215641 2820 > mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 32MB for container > f68f137c-b101-4f9f-8de4-f50eae27e969 > May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.216878 2823 > cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' > to 10ms (cpus 0.1) for container f68f137c-b101-4f9f-8de4-f50eae27e969 > {code} > Agent logs for this task upon restart: > {code} > May 09 15:22:14 ip-10-10-0-6 mesos-slave[14314]: W0509 15:22:14.083993 14318 > state.cpp:606] Failed to find status updates file >
[jira] [Updated] (MESOS-5349) A large number of tasks stuck in Staging state.
[ https://issues.apache.org/jira/browse/MESOS-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-5349: -- Attachment: mesos-slave.WARNING mesos-master.WARNING master-state.log agent-state.log Trying to upload the detailed master/agent logs somewhere else as they are rather large. > A large number of tasks stuck in Staging state. > --- > > Key: MESOS-5349 > URL: https://issues.apache.org/jira/browse/MESOS-5349 > Project: Mesos > Issue Type: Bug > Components: slave >Affects Versions: 0.29.0 >Reporter: Anand Mazumdar > Labels: mesosphere > Attachments: agent-state.log, master-state.log, mesos-master.WARNING, > mesos-slave.WARNING > > > We saw a weird issue happening on one of our test clusters over the weekend. > A large number of tasks from the example {{long running framework}} were > stuck in staging. The executor was duly sending status updates for all the > tasks and the slave successfully received the status update as seen from the > logs but for some reason never got to checkpointing them. > From the agent logs, it seems that it kept on retrying some backlogged status > updates starting with the 4xxx/6xxx range while the present tasks were > launched in the 8xxx range. (task ID) > The issue resolved itself after a few hours upon the agent (re-)registering > with the master upon loosing its ZK session. > Let's take a timeline of a particular task 8142. > Agent logs before restart > {code} > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.204941 2820 > slave.cpp:1522] Got assigned task 8142 for framework > ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205142 2820 > slave.cpp:1641] Launching task 8142 for framework > ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205656 2820 > slave.cpp:1880] Queuing task '8142' for executor 'default' of framework > ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP) > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.206092 2818 > disk.cpp:169] Updating the disk resources for container > f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.101; mem(*):33 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207093 2816 > mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 33MB for container > f68f137c-b101-4f9f-8de4-f50eae27e969 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207293 2821 > cpushare.cpp:389] Updated 'cpu.shares' to 103 (cpus 0.101) for container > f68f137c-b101-4f9f-8de4-f50eae27e969 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208742 2821 > cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' > to 10100us (cpus 0.101) for container f68f137c-b101-4f9f-8de4-f50eae27e969 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208902 2818 > slave.cpp:2032] Sending queued task '8142' to executor 'default' of framework > ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP) > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210290 2821 > http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921 > May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210357 2821 > slave.cpp:3221] Handling status update TASK_RUNNING (UUID: > 85323c4f-e523-495e-9b49-39b0a7792303) for task 8142 of framework > ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 > May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213770 2817 > http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921 > May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213882 2817 > slave.cpp:3221] Handling status update TASK_FINISHED (UUID: > 285b73e1-7f5a-43e5-8385-7b76e0fbdad4) for task 8142 of framework > ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 > May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.214787 2821 > disk.cpp:169] Updating the disk resources for container > f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.1; mem(*):32 > May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215365 2823 > cpushare.cpp:389] Updated 'cpu.shares' to 102 (cpus 0.1) for container > f68f137c-b101-4f9f-8de4-f50eae27e969 > May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215641 2820 > mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 32MB for container > f68f137c-b101-4f9f-8de4-f50eae27e969 > May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.216878 2823 > cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' > to 10ms (cpus 0.1) for container f68f137c-b101-4f9f-8de4-f50eae27e969 > {code} > Agent logs for this task
[jira] [Created] (MESOS-5349) A large number of tasks stuck in Staging state.
Anand Mazumdar created MESOS-5349: - Summary: A large number of tasks stuck in Staging state. Key: MESOS-5349 URL: https://issues.apache.org/jira/browse/MESOS-5349 Project: Mesos Issue Type: Bug Components: slave Affects Versions: 0.29.0 Reporter: Anand Mazumdar We saw a weird issue happening on one of our test clusters over the weekend. A large number of tasks from the example {{long running framework}} were stuck in staging. The executor was duly sending status updates for all the tasks and the slave successfully received the status update as seen from the logs but for some reason never got to checkpointing them. >From the agent logs, it seems that it kept on retrying some backlogged status >updates starting with the 4xxx/6xxx range while the present tasks were >launched in the 8xxx range. (task ID) The issue resolved itself after a few hours upon the agent (re-)registering with the master upon loosing its ZK session. Let's take a timeline of a particular task 8142. Agent logs before restart {code} May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.204941 2820 slave.cpp:1522] Got assigned task 8142 for framework ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205142 2820 slave.cpp:1641] Launching task 8142 for framework ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205656 2820 slave.cpp:1880] Queuing task '8142' for executor 'default' of framework ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP) May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.206092 2818 disk.cpp:169] Updating the disk resources for container f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.101; mem(*):33 May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207093 2816 mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 33MB for container f68f137c-b101-4f9f-8de4-f50eae27e969 May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207293 2821 cpushare.cpp:389] Updated 'cpu.shares' to 103 (cpus 0.101) for container f68f137c-b101-4f9f-8de4-f50eae27e969 May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208742 2821 cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' to 10100us (cpus 0.101) for container f68f137c-b101-4f9f-8de4-f50eae27e969 May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208902 2818 slave.cpp:2032] Sending queued task '8142' to executor 'default' of framework ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP) May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210290 2821 http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921 May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210357 2821 slave.cpp:3221] Handling status update TASK_RUNNING (UUID: 85323c4f-e523-495e-9b49-39b0a7792303) for task 8142 of framework ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213770 2817 http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921 May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213882 2817 slave.cpp:3221] Handling status update TASK_FINISHED (UUID: 285b73e1-7f5a-43e5-8385-7b76e0fbdad4) for task 8142 of framework ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.214787 2821 disk.cpp:169] Updating the disk resources for container f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.1; mem(*):32 May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215365 2823 cpushare.cpp:389] Updated 'cpu.shares' to 102 (cpus 0.1) for container f68f137c-b101-4f9f-8de4-f50eae27e969 May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215641 2820 mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 32MB for container f68f137c-b101-4f9f-8de4-f50eae27e969 May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.216878 2823 cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' to 10ms (cpus 0.1) for container f68f137c-b101-4f9f-8de4-f50eae27e969 {code} Agent logs for this task upon restart: {code} May 09 15:22:14 ip-10-10-0-6 mesos-slave[14314]: W0509 15:22:14.083993 14318 state.cpp:606] Failed to find status updates file '/var/lib/mesos/slave/meta/slaves/ad2ee74e-24f1-4381-be9a-1af70ba1ced0-S1/frameworks/ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003/executors/default/runs/f68f137c-b101-4f9f-8de4-f50eae27e969/tasks/8142/task.updates' {code} Things that need to be investigated: - Why couldn't the agent get around to handling the status updates from the executor i.e. even checkpointing them? - What made the agent get _so_ backlogged on the status updates i.e. why it kept resending the old status updates for the
[jira] [Commented] (MESOS-3220) Offer ability to kill tasks from the API
[ https://issues.apache.org/jira/browse/MESOS-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276822#comment-15276822 ] Michael Gummelt commented on MESOS-3220: +1. I'm implementing this behavior in Spark. It would be more efficient if mesos offered it, so we wouldn't have to reimplement at the framework level. > Offer ability to kill tasks from the API > > > Key: MESOS-3220 > URL: https://issues.apache.org/jira/browse/MESOS-3220 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Sunil Shah > Labels: mesosphere > > We are investigating adding a {{dcos task kill}} command to our DCOS (and > Mesos) command line interface. Currently the ability to kill tasks is only > offered via the scheduler API so it would be useful to have some ability to > kill tasks directly. > This would complement the Maintenance Primitives, in that it would enable the > operator to terminate those tasks which, for whatever reasons, do not respond > to Inverse Offers events. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3243) Replace NULL with nullptr
[ https://issues.apache.org/jira/browse/MESOS-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276758#comment-15276758 ] Tomasz Janiszewski commented on MESOS-3243: --- Ping > Replace NULL with nullptr > - > > Key: MESOS-3243 > URL: https://issues.apache.org/jira/browse/MESOS-3243 > Project: Mesos > Issue Type: Improvement >Reporter: Michael Park >Assignee: Tomasz Janiszewski > > As part of the C++ upgrade, it would be nice to move our use of {{NULL}} over > to use {{nullptr}}. I think it would be an interesting exercise to do this > with {{clang-modernize}} using the [nullptr > transform|http://clang.llvm.org/extra/UseNullptrTransform.html] (although > it's probably just as easy to use {{sed}}). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5212) Allow any principal in ReservationInfo when HTTP authentication is off
[ https://issues.apache.org/jira/browse/MESOS-5212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bernd Mathiske updated MESOS-5212: -- Shepherd: Bernd Mathiske > Allow any principal in ReservationInfo when HTTP authentication is off > -- > > Key: MESOS-5212 > URL: https://issues.apache.org/jira/browse/MESOS-5212 > Project: Mesos > Issue Type: Improvement >Affects Versions: 0.28.1 >Reporter: Greg Mann >Assignee: Greg Mann > Labels: mesosphere > Fix For: 0.29.0 > > > Mesos currently provides no way for operators to pass their principal to HTTP > endpoints when HTTP authentication is off. Since we enforce that > {{ReservationInfo.principal}} be equal to the operator principal in requests > to {{/reserve}}, this means that when HTTP authentication is disabled, the > {{ReservationInfo.principal}} field cannot be set. > To address this in the short-term, we should allow > {{ReservationInfo.principal}} to hold any value when HTTP authentication is > disabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3926) Modularize URI fetcher plugin interface.
[ https://issues.apache.org/jira/browse/MESOS-3926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Wu updated MESOS-3926: - Sprint: (was: Mesosphere Sprint 35) > Modularize URI fetcher plugin interface. > -- > > Key: MESOS-3926 > URL: https://issues.apache.org/jira/browse/MESOS-3926 > Project: Mesos > Issue Type: Task > Components: fetcher >Reporter: Jie Yu >Assignee: Shuai Lin > Labels: fetcher, mesosphere, module > > So that we can add custom URI fetcher plugins using modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5239) Persistent volume DockerContainerizer support assumes proper mount propagation setup on the host.
[ https://issues.apache.org/jira/browse/MESOS-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-5239: -- Sprint: Mesosphere Sprint 34 Story Points: 3 > Persistent volume DockerContainerizer support assumes proper mount > propagation setup on the host. > - > > Key: MESOS-5239 > URL: https://issues.apache.org/jira/browse/MESOS-5239 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 0.28.0, 0.28.1 >Reporter: Jie Yu >Assignee: Jie Yu > Labels: mesosphere > Fix For: 0.29.0, 0.28.2 > > > We recently added persistent volume support in DockerContainerizer > (MESOS-3413). To understand the problem, we first need to understand how > persistent volumes are supported in DockerContainerizer. > To support persistent volumes in DockerContainerizer, we bind mount > persistent volumes under a container's sandbox ('container_path' has to be > relative for persistent volumes). When the Docker container is launched, > since we always add a volume (-v) for the sandbox, the persistent volumes > will be bind mounted into the container as well (since Docker does a 'rbind'). > The assumption that the above works is that the Docker daemon should see > those persistent volume mounts that Mesos mounts on the host mount table. > It's not a problem if Docker daemon itself is using the host mount namespace. > However, on systemd enabled systems, Docker daemon is running in a separate > mount namespace and all mounts in that mount namespace will be marked as > slave mounts due to this > [patch|https://github.com/docker/docker/commit/eb76cb2301fc883941bc4ca2d9ebc3a486ab8e0a]. > So what that means is that: in order for it to work, the parent mount of > agent's work_dir should be a shared mount when docker daemon starts. This is > typically true on CentOS7, CoreOS as all mounts are shared mounts by default. > However, this causes an issue with the 'filesystem/linux' isolator. To > understand why, first I need to show you a typical problem when dealing with > shared mounts. Let me explain that using the following commands on a CentOS7 > machine: > {noformat} > [root@core-dev run]# cat /proc/self/mountinfo > 24 60 0:19 / /run rw,nosuid,nodev shared:22 - tmpfs tmpfs rw,seclabel,mode=755 > [root@core-dev run]# mkdir /run/netns > [root@core-dev run]# mount --bind /run/netns /run/netns > [root@core-dev run]# cat /proc/self/mountinfo > 24 60 0:19 / /run rw,nosuid,nodev shared:22 - tmpfs tmpfs rw,seclabel,mode=755 > 121 24 0:19 /netns /run/netns rw,nosuid,nodev shared:22 - tmpfs tmpfs > rw,seclabel,mode=755 > [root@core-dev run]# ip netns add test > [root@core-dev run]# cat /proc/self/mountinfo > 24 60 0:19 / /run rw,nosuid,nodev shared:22 - tmpfs tmpfs rw,seclabel,mode=755 > 121 24 0:19 /netns /run/netns rw,nosuid,nodev shared:22 - tmpfs tmpfs > rw,seclabel,mode=755 > 162 121 0:3 / /run/netns/test rw,nosuid,nodev,noexec,relatime shared:5 - proc > proc rw > 163 24 0:3 / /run/netns/test rw,nosuid,nodev,noexec,relatime shared:5 - proc > proc rw > {noformat} > As you can see above, there're two entries (/run/netns/test) in the mount > table (unexpected). This will confuse some systems sometimes. The reason is > because when we create a self bind mount (/run/netns -> /run/netns), the > mount will be put into the same shared mount peer group (shared:22) as its > parent (/run). Then, when you create another mount underneath that > (/run/netns/test), that mount operation will be propagated to all mounts in > the same peer group (shared:22), resulting an unexpected additional mount > being created. > The reason we need to do a self bind mount in Mesos is that sometimes, we > need to make sure some mounts are shared so that it does not get copied when > a new mount namespace is created. However, on some systems, mounts are > private by default (e.g., Ubuntu 14.04). In those cases, since we cannot > change the system mounts, we have to do a self bind mount so that we can set > mount propagation to shared. For instance, in filesytem/linux isolator, we do > a self bind mount on agent's work_dir. > To avoid the self bind mount pitfall mentioned above, in filesystem/linux > isolator, after we created the mount, we do a make-slave + make-shared so > that the mount is its own shared mount peer group. In that way, any mounts > underneath it will not be propagated back. > However, that operation will break the assumption that the persistent volume > DockerContainerizer support makes. As a result, we're seeing problem with > persistent volumes in DockerContainerizer when filesystem/linux isolator is > turned on. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5307) Sandbox mounts should not be in the host mount namespace.
[ https://issues.apache.org/jira/browse/MESOS-5307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-5307: -- Labels: mesosphere (was: ) > Sandbox mounts should not be in the host mount namespace. > - > > Key: MESOS-5307 > URL: https://issues.apache.org/jira/browse/MESOS-5307 > Project: Mesos > Issue Type: Improvement >Reporter: Jie Yu >Assignee: Jie Yu > Labels: mesosphere > Fix For: 0.29.0, 0.28.2 > > > Currently, if a container uses container image, we'll do a bind mount of its > sandbox ( -> /mnt/mesos/sandbox) in the host mount namespace. > However, doing the mounts in the host mount table is not ideal. That > complicates both the cleanup path and the recovery path. > Instead, we can do the sandbox bind mount in the container's mount namespace > so that cleanup and recovery will be greatly simplified. We can setup mount > propagation properly so that persistent volumes mounted at /xxx can > be propagated into the container. > Here is a simple proof of concept: > Console 1: > {noformat} > vagrant@vagrant-ubuntu-trusty-64:~/tmp/mesos$ ll . > total 12 > drwxrwxr-x 3 vagrant vagrant 4096 Apr 25 16:05 ./ > drwxrwxr-x 6 vagrant vagrant 4096 Apr 25 23:17 ../ > drwxrwxr-x 5 vagrant vagrant 4096 Apr 25 23:17 slave/ > vagrant@vagrant-ubuntu-trusty-64:~/tmp/mesos$ ll slave/ > total 20 > drwxrwxr-x 5 vagrant vagrant 4096 Apr 25 23:17 ./ > drwxrwxr-x 3 vagrant vagrant 4096 Apr 25 16:05 ../ > drwxrwxr-x 6 vagrant vagrant 4096 Apr 26 21:06 directory/ > drwxr-xr-x 12 vagrant vagrant 4096 Apr 25 23:20 rootfs/ > drwxrwxr-x 2 vagrant vagrant 4096 Apr 25 16:09 volume/ > vagrant@vagrant-ubuntu-trusty-64:~/tmp/mesos$ sudo mount --bind slave/ slave/ > > > > vagrant@vagrant-ubuntu-trusty-64:~/tmp/mesos$ sudo mount --make-shared slave/ > vagrant@vagrant-ubuntu-trusty-64:~/tmp/mesos$ cat /proc/self/mountinfo > 50 22 8:1 /home/vagrant/tmp/mesos/slave /home/vagrant/tmp/mesos/slave > rw,relatime shared:1 - ext4 > /dev/disk/by-uuid/baf292e5-0bb6-4e58-8a71-5b912e0f09b6 rw,data=ordered > {noformat} > Console 2: > {noformat} > vagrant@vagrant-ubuntu-trusty-64:~/tmp/mesos$ cd slave/ > vagrant@vagrant-ubuntu-trusty-64:~/tmp/mesos/slave$ sudo unshare -m /bin/bash > root@vagrant-ubuntu-trusty-64:~/tmp/mesos/slave# sudo mount --make-rslave . > root@vagrant-ubuntu-trusty-64:~/tmp/mesos/slave# cat /proc/self/mountinfo > 124 63 8:1 /home/vagrant/tmp/mesos/slave /home/vagrant/tmp/mesos/slave > rw,relatime master:1 - ext4 > /dev/disk/by-uuid/baf292e5-0bb6-4e58-8a71-5b912e0f09b6 rw,data=ordered > root@vagrant-ubuntu-trusty-64:~/tmp/mesos/slave# mount --rbind directory/ > rootfs/mnt/mesos/sandbox/ > > > root@vagrant-ubuntu-trusty-64:~/tmp/mesos/slave# mount --rbind rootfs/ rootfs/ > root@vagrant-ubuntu-trusty-64:~/tmp/mesos/slave# mount -t proc proc > rootfs/proc > > > root@vagrant-ubuntu-trusty-64:~/tmp/mesos/slave# pivot_root rootfs > rootfs/tmp/.rootfs > > > root@vagrant-ubuntu-trusty-64:~/tmp/mesos/slave# cd / > root@vagrant-ubuntu-trusty-64:/# cat /proc/self/mountinfo > 126 61 8:1 /home/vagrant/tmp/mesos/slave/rootfs / rw,relatime master:1 - ext4 > /dev/disk/by-uuid/baf292e5-0bb6-4e58-8a71-5b912e0f09b6 rw,data=ordered > 127 126 8:1 /home/vagrant/tmp/mesos/slave/directory /mnt/mesos/sandbox > rw,relatime master:1 - ext4 > /dev/disk/by-uuid/baf292e5-0bb6-4e58-8a71-5b912e0f09b6 rw,data=ordered > 128 126 0:3 / /proc rw,relatime - proc proc rw > {noformat} > Console 1: > {noformat} > agrant@vagrant-ubuntu-trusty-64:~/tmp/mesos$ cd slave/ > vagrant@vagrant-ubuntu-trusty-64:~/tmp/mesos/slave$ sudo mount --bind volume/ > directory/v1 > vagrant@vagrant-ubuntu-trusty-64:~/tmp/mesos/slave$ cat /proc/self/mountinfo > 50 22 8:1 /home/vagrant/tmp/mesos/slave /home/vagrant/tmp/mesos/slave > rw,relatime shared:1 - ext4 > /dev/disk/by-uuid/baf292e5-0bb6-4e58-8a71-5b912e0f09b6 rw,data=ordered > 129 50 8:1 /home/vagrant/tmp/mesos/slave/volume > /home/vagrant/tmp/mesos/slave/directory/v1 rw,relatime
[jira] [Updated] (MESOS-3926) Modularize URI fetcher plugin interface.
[ https://issues.apache.org/jira/browse/MESOS-3926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Harutyunyan updated MESOS-3926: - Sprint: Mesosphere Sprint 35 > Modularize URI fetcher plugin interface. > -- > > Key: MESOS-3926 > URL: https://issues.apache.org/jira/browse/MESOS-3926 > Project: Mesos > Issue Type: Task > Components: fetcher >Reporter: Jie Yu >Assignee: Shuai Lin > Labels: fetcher, mesosphere, module > > So that we can add custom URI fetcher plugins using modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3926) Modularize URI fetcher plugin interface.
[ https://issues.apache.org/jira/browse/MESOS-3926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Harutyunyan updated MESOS-3926: - Sprint: (was: Mesosphere Sprint 34) > Modularize URI fetcher plugin interface. > -- > > Key: MESOS-3926 > URL: https://issues.apache.org/jira/browse/MESOS-3926 > Project: Mesos > Issue Type: Task > Components: fetcher >Reporter: Jie Yu >Assignee: Shuai Lin > Labels: fetcher, mesosphere, module > > So that we can add custom URI fetcher plugins using modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5277) Need to add REMOVE semantics to the copy backend
[ https://issues.apache.org/jira/browse/MESOS-5277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilbert Song updated MESOS-5277: Sprint: Mesosphere Sprint 35 > Need to add REMOVE semantics to the copy backend > > > Key: MESOS-5277 > URL: https://issues.apache.org/jira/browse/MESOS-5277 > Project: Mesos > Issue Type: Bug > Components: containerization > Environment: linux >Reporter: Avinash Sridharan >Assignee: Gilbert Song > Labels: mesosphere > Fix For: 0.29.0 > > > Some Dockerfile run the `rm` command to remove files from the base image > using the "RUN" directive in the Dockerfile. An example can be found here: > https://github.com/ngineered/nginx-php-fpm.git > In the final rootfs the removed files should not be present. Presence of > these files in the final image can make the container misbehave. For example, > the nginx-php-fpm docker image that is reference tries to remove the default > nginx config and replace it with it own config to point a different HTML > root. If the default nginx config is still present after the building the > image, nginx will start pointing to a different HTML root than the one set in > the Dockerfile. > Currently the copy backend cannot handle removal of files from intermediate > layers. This can cause issues with docker images built using a Dockerfile > similar to the one listed here. Hence, we need to add REMOVE semantics to the > copy backend. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4771) Document the network/cni isolator.
[ https://issues.apache.org/jira/browse/MESOS-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Avinash Sridharan updated MESOS-4771: - Sprint: Mesosphere Sprint 35 > Document the network/cni isolator. > -- > > Key: MESOS-4771 > URL: https://issues.apache.org/jira/browse/MESOS-4771 > Project: Mesos > Issue Type: Task >Reporter: Jie Yu >Assignee: Avinash Sridharan > > We need to document this isolator in mesos-containerizer.md (e.g., how to > configure it, what's the pre-requisite, etc.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4823) Implement port forwarding in `network/cni` isolator
[ https://issues.apache.org/jira/browse/MESOS-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Avinash Sridharan updated MESOS-4823: - Sprint: Mesosphere Sprint 30, Mesosphere Sprint 31, Mesosphere Sprint 35 (was: Mesosphere Sprint 30, Mesosphere Sprint 31) > Implement port forwarding in `network/cni` isolator > --- > > Key: MESOS-4823 > URL: https://issues.apache.org/jira/browse/MESOS-4823 > Project: Mesos > Issue Type: Task > Components: containerization > Environment: linux >Reporter: Avinash Sridharan >Assignee: Avinash Sridharan >Priority: Critical > Labels: mesosphere > > Most docker and appc images wish to expose ports that micro-services are > listening on, to the outside world. When containers are running on bridged > (or ptp) networking this can be achieved by installing port forwarding rules > on the agent (using iptables). This can be done in the `network/cni` > isolator. > The reason we would like this functionality to be implemented in the > `network/cni` isolator, and not a CNI plugin, is that the specifications > currently do not support specifying port forwarding rules. Further, to > install these rules the isolator needs two pieces of information, the exposed > ports and the IP address associated with the container. Bother are available > to the isolator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5066) Create an iptables interface in Mesos
[ https://issues.apache.org/jira/browse/MESOS-5066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Avinash Sridharan updated MESOS-5066: - Sprint: Mesosphere Sprint 35 > Create an iptables interface in Mesos > - > > Key: MESOS-5066 > URL: https://issues.apache.org/jira/browse/MESOS-5066 > Project: Mesos > Issue Type: Task > Components: containerization >Reporter: Avinash Sridharan >Assignee: Avinash Sridharan > Labels: mesosphere > > For support port mapping functionality in the network CNI isolator we need to > enable DNAT rules in iptables. We therefore need to create an iptables > interface in Mesos. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-4626) Support Nvidia GPUs with filesystem isolation enabled.
[ https://issues.apache.org/jira/browse/MESOS-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Klues reassigned MESOS-4626: -- Assignee: Kevin Klues > Support Nvidia GPUs with filesystem isolation enabled. > -- > > Key: MESOS-4626 > URL: https://issues.apache.org/jira/browse/MESOS-4626 > Project: Mesos > Issue Type: Task > Components: isolation >Reporter: Benjamin Mahler >Assignee: Kevin Klues > > When filesystem isolation is enabled, containers that use Nvidia GPU > resources need access to GPU libraries residing on the host. > We'll need to provide a means for operators to inject the necessary volumes > into *all* containers that use "gpus" resources. > See the nvidia-docker project for more details: > [nvidia-docker/tools/src/nvidia/volumes.go|https://github.com/NVIDIA/nvidia-docker/blob/fda10b2d27bf5578cc5337c23877f827e4d1ed77/tools/src/nvidia/volumes.go#L50-L103] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-5257) Add autodiscovery for GPU resources
[ https://issues.apache.org/jira/browse/MESOS-5257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Klues reassigned MESOS-5257: -- Assignee: Kevin Klues > Add autodiscovery for GPU resources > --- > > Key: MESOS-5257 > URL: https://issues.apache.org/jira/browse/MESOS-5257 > Project: Mesos > Issue Type: Task >Reporter: Kevin Klues >Assignee: Kevin Klues > Labels: isolator > > Right now, the only way to enumerate the available GPUs on an agent is to use > the `--nvidia_gpu_devices` flag and explicitly list them out. Instead, we > should leverage NVML to autodiscover the GPUs that are available and only use > this flag as a way to explicitly list out the GPUs you want to make available > in order to restrict access to some of them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5258) Turn the Nvidia GPU isolator into a module
[ https://issues.apache.org/jira/browse/MESOS-5258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Klues updated MESOS-5258: --- Sprint: Mesosphere Sprint 35 Story Points: 5 (was: 3) > Turn the Nvidia GPU isolator into a module > -- > > Key: MESOS-5258 > URL: https://issues.apache.org/jira/browse/MESOS-5258 > Project: Mesos > Issue Type: Task >Reporter: Kevin Klues > > The Nvidia GPU isolator has an external dependence on `libnvidia-ml.so`. As > it currently stands, this forces *all* binaries that link with `libmesos.so` > to also link with `libnvidia-ml.so` (including master, agents on machines > without GPUs, scheduler, exectors, etc.). > By turning the Nvidia GPU isolator into a module, it will be loaded at > runtime only when an agent has explicitly including the the Nvidia GPU > isolator in its `--isolation` flag. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-5258) Turn the Nvidia GPU isolator into a module
[ https://issues.apache.org/jira/browse/MESOS-5258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Klues reassigned MESOS-5258: -- Assignee: Kevin Klues > Turn the Nvidia GPU isolator into a module > -- > > Key: MESOS-5258 > URL: https://issues.apache.org/jira/browse/MESOS-5258 > Project: Mesos > Issue Type: Task >Reporter: Kevin Klues >Assignee: Kevin Klues > > The Nvidia GPU isolator has an external dependence on `libnvidia-ml.so`. As > it currently stands, this forces *all* binaries that link with `libmesos.so` > to also link with `libnvidia-ml.so` (including master, agents on machines > without GPUs, scheduler, exectors, etc.). > By turning the Nvidia GPU isolator into a module, it will be loaded at > runtime only when an agent has explicitly including the the Nvidia GPU > isolator in its `--isolation` flag. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4626) Support Nvidia GPUs with filesystem isolation enabled.
[ https://issues.apache.org/jira/browse/MESOS-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Klues updated MESOS-4626: --- Sprint: Mesosphere Sprint 35 > Support Nvidia GPUs with filesystem isolation enabled. > -- > > Key: MESOS-4626 > URL: https://issues.apache.org/jira/browse/MESOS-4626 > Project: Mesos > Issue Type: Task > Components: isolation >Reporter: Benjamin Mahler > > When filesystem isolation is enabled, containers that use Nvidia GPU > resources need access to GPU libraries residing on the host. > We'll need to provide a means for operators to inject the necessary volumes > into *all* containers that use "gpus" resources. > See the nvidia-docker project for more details: > [nvidia-docker/tools/src/nvidia/volumes.go|https://github.com/NVIDIA/nvidia-docker/blob/fda10b2d27bf5578cc5337c23877f827e4d1ed77/tools/src/nvidia/volumes.go#L50-L103] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5257) Add autodiscovery for GPU resources
[ https://issues.apache.org/jira/browse/MESOS-5257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Klues updated MESOS-5257: --- Sprint: Mesosphere Sprint 35 > Add autodiscovery for GPU resources > --- > > Key: MESOS-5257 > URL: https://issues.apache.org/jira/browse/MESOS-5257 > Project: Mesos > Issue Type: Task >Reporter: Kevin Klues > Labels: isolator > > Right now, the only way to enumerate the available GPUs on an agent is to use > the `--nvidia_gpu_devices` flag and explicitly list them out. Instead, we > should leverage NVML to autodiscover the GPUs that are available and only use > this flag as a way to explicitly list out the GPUs you want to make available > in order to restrict access to some of them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5221) Add Documentation for Nvidia GPU support
[ https://issues.apache.org/jira/browse/MESOS-5221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Klues updated MESOS-5221: --- Sprint: Mesosphere Sprint 33, Mesosphere Sprint 35 (was: Mesosphere Sprint 33, Mesosphere Sprint 34) > Add Documentation for Nvidia GPU support > > > Key: MESOS-5221 > URL: https://issues.apache.org/jira/browse/MESOS-5221 > Project: Mesos > Issue Type: Documentation >Reporter: Kevin Klues >Assignee: Kevin Klues >Priority: Minor > > https://reviews.apache.org/r/46220/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5347) Enhance the log message when launching mesos containerizer.
[ https://issues.apache.org/jira/browse/MESOS-5347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilbert Song updated MESOS-5347: Fix Version/s: 0.29.0 > Enhance the log message when launching mesos containerizer. > --- > > Key: MESOS-5347 > URL: https://issues.apache.org/jira/browse/MESOS-5347 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: Gilbert Song >Assignee: Guangya Liu > Labels: mesosphere > Fix For: 0.29.0 > > > Log the launch flag which includes the executor command, pre-launch commands > and other information when launching the mesos containerizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5348) Enhance the log message when launching docker containerizer.
Gilbert Song created MESOS-5348: --- Summary: Enhance the log message when launching docker containerizer. Key: MESOS-5348 URL: https://issues.apache.org/jira/browse/MESOS-5348 Project: Mesos Issue Type: Improvement Components: containerization Reporter: Gilbert Song Assignee: Guangya Liu Fix For: 0.29.0 Log the launch flag which includes the executor command and other information when launching the docker containerizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276658#comment-15276658 ] Chris commented on MESOS-5342: -- Forgot to mention, a shepard is needed to support integration of this feature! > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, specifically in terms of cache locality. > Applications requiring GPU resources can benefit from this feature by getting > access to cores closest to the GPU hardware, which reduces cpu-gpu copy > latency. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology, and the advent of Intel's Knights-series processors, will require > making choices about where container's are going to run on the mesos-agent's > processor(s) cores - this feature is a step toward developing a robust > solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5173) Allow master/agent to take multiple modules manifest files
[ https://issues.apache.org/jira/browse/MESOS-5173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kapil Arya updated MESOS-5173: -- Shepherd: Till Toenshoff Summary: Allow master/agent to take multiple modules manifest files (was: Allow master/agent to take multiple --modules flags) > Allow master/agent to take multiple modules manifest files > -- > > Key: MESOS-5173 > URL: https://issues.apache.org/jira/browse/MESOS-5173 > Project: Mesos > Issue Type: Task >Reporter: Kapil Arya >Assignee: Kapil Arya > Labels: mesosphere > Fix For: 0.29.0 > > > When loading multiple modules into master/agent, one has to merge all module > metadata (library name, module name, parameters, etc.) into a single json > file which is then passed on to the --modules flag. This quickly becomes > cumbersome especially if the modules are coming from different > vendors/developers. > An alternate would be to allow multiple invocations of --modules flag that > can then be passed on to the module manager. That way, each flag corresponds > to just one module library and modules from that library. > Another approach is to create a new flag (e.g., --modules-dir) that contains > a path to a directory that would contain multiple json files. One can think > of it as an analogous to systemd units. The operator that drops a new file > into this directory and the file would automatically be picked up by the > master/agent module manager. Further, the naming scheme can also be inherited > to prefix the filename with an "NN_" to signify oad order. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276521#comment-15276521 ] Chris commented on MESOS-5342: -- For information about submodular functions (and why it was selected for this problem), strongly suggest reviewing at least this youtube lecture/video (ideally the entire series of videos) publicly available from MLSS Iceland 2014: https://youtu.be/6ThMzlHdKsI > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, specifically in terms of cache locality. > Applications requiring GPU resources can benefit from this feature by getting > access to cores closest to the GPU hardware, which reduces cpu-gpu copy > latency. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology, and the advent of Intel's Knights-series processors, will require > making choices about where container's are going to run on the mesos-agent's > processor(s) cores - this feature is a step toward developing a robust > solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris updated MESOS-5342: - Comment: was deleted (was: For information about submodular functions (and why it was selected for this problem), strongly suggest reviewing at least this youtube lecture/video (ideally the entire series of videos) publicly available from MLSS Iceland 2014: https://youtu.be/6ThMzlHdKsI) > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, specifically in terms of cache locality. > Applications requiring GPU resources can benefit from this feature by getting > access to cores closest to the GPU hardware, which reduces cpu-gpu copy > latency. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology, and the advent of Intel's Knights-series processors, will require > making choices about where container's are going to run on the mesos-agent's > processor(s) cores - this feature is a step toward developing a robust > solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276518#comment-15276518 ] Chris commented on MESOS-5342: -- For information about submodular functions (and why it was selected for this problem), strongly suggest reviewing at least this youtube lecture/video (ideally the entire series of videos) publicly available from MLSS Iceland 2014: https://youtu.be/6ThMzlHdKsI > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, specifically in terms of cache locality. > Applications requiring GPU resources can benefit from this feature by getting > access to cores closest to the GPU hardware, which reduces cpu-gpu copy > latency. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology, and the advent of Intel's Knights-series processors, will require > making choices about where container's are going to run on the mesos-agent's > processor(s) cores - this feature is a step toward developing a robust > solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris updated MESOS-5342: - Comment: was deleted (was: For information about submodular functions (and why it was selected for this problem), strongly suggest reviewing this youtube video: https://youtu.be/6ThMzlHdKsI) > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, specifically in terms of cache locality. > Applications requiring GPU resources can benefit from this feature by getting > access to cores closest to the GPU hardware, which reduces cpu-gpu copy > latency. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology, and the advent of Intel's Knights-series processors, will require > making choices about where container's are going to run on the mesos-agent's > processor(s) cores - this feature is a step toward developing a robust > solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276514#comment-15276514 ] Chris commented on MESOS-5342: -- For information about submodular functions (and why it was selected for this problem), strongly suggest reviewing this youtube video: https://youtu.be/6ThMzlHdKsI > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, specifically in terms of cache locality. > Applications requiring GPU resources can benefit from this feature by getting > access to cores closest to the GPU hardware, which reduces cpu-gpu copy > latency. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology, and the advent of Intel's Knights-series processors, will require > making choices about where container's are going to run on the mesos-agent's > processor(s) cores - this feature is a step toward developing a robust > solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris updated MESOS-5342: - Comment: was deleted (was: Fixed a small bug in the greedy submodular subset selection algorithm. The "submodular cost" of selecting a core was being used in the knapsack budget test (cores currently have an at-most-budget-cost of "1.0"). The correct cost is now being used in the test.) > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, specifically in terms of cache locality. > Applications requiring GPU resources can benefit from this feature by getting > access to cores closest to the GPU hardware, which reduces cpu-gpu copy > latency. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology, and the advent of Intel's Knights-series processors, will require > making choices about where container's are going to run on the mesos-agent's > processor(s) cores - this feature is a step toward developing a robust > solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276500#comment-15276500 ] Chris commented on MESOS-5342: -- Fixed a small bug in the greedy submodular subset selection algorithm. The "submodular cost" of selecting a core was being used in the knapsack budget test (cores currently have an at-most-budget-cost of "1.0"). The correct cost is now being used in the test. > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, specifically in terms of cache locality. > Applications requiring GPU resources can benefit from this feature by getting > access to cores closest to the GPU hardware, which reduces cpu-gpu copy > latency. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology, and the advent of Intel's Knights-series processors, will require > making choices about where container's are going to run on the mesos-agent's > processor(s) cores - this feature is a step toward developing a robust > solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5168) Benchmark overhead of authorization based filtering.
[ https://issues.apache.org/jira/browse/MESOS-5168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276477#comment-15276477 ] Joerg Schad commented on MESOS-5168: and the current prototype used to benchmark here: https://github.com/joerg84/mesos/tree/filterPrototype > Benchmark overhead of authorization based filtering. > > > Key: MESOS-5168 > URL: https://issues.apache.org/jira/browse/MESOS-5168 > Project: Mesos > Issue Type: Improvement >Reporter: Joerg Schad >Assignee: Joerg Schad > Labels: authorization, mesosphere, security > Fix For: 0.29.0 > > > When adding authorization based filtering as outlined in MESOS-4931 we need > to be careful especially for performance critical endpoints such as /state. > We should ensure via a benchmark that performance does not degreade below an > acceptable state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5168) Benchmark overhead of authorization based filtering.
[ https://issues.apache.org/jira/browse/MESOS-5168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276475#comment-15276475 ] Joerg Schad commented on MESOS-5168: Benchmark results can be found here: https://docs.google.com/document/d/1Ojq55I_2iyYWMSxnq9TvshVn9JGwEX4PfRBkwgZTiuY > Benchmark overhead of authorization based filtering. > > > Key: MESOS-5168 > URL: https://issues.apache.org/jira/browse/MESOS-5168 > Project: Mesos > Issue Type: Improvement >Reporter: Joerg Schad >Assignee: Joerg Schad > Labels: authorization, mesosphere, security > Fix For: 0.29.0 > > > When adding authorization based filtering as outlined in MESOS-4931 we need > to be careful especially for performance critical endpoints such as /state. > We should ensure via a benchmark that performance does not degreade below an > acceptable state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5346) Some endpoints do not specify their allowed request methods.
[ https://issues.apache.org/jira/browse/MESOS-5346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht updated MESOS-5346: Priority: Minor (was: Major) > Some endpoints do not specify their allowed request methods. > > > Key: MESOS-5346 > URL: https://issues.apache.org/jira/browse/MESOS-5346 > Project: Mesos > Issue Type: Bug > Components: security, technical debt >Reporter: Jan Schlicht >Priority: Minor > Labels: http, security, tech-debt > > Some HTTP endpoints (for example "/flags" or "/state") create a response > regardless of what the request method is. For example an HTTP POST to the > "/state" endpoint will create the same response as an HTTP GET. > While this inconsistency isn't harmful at the moment, it will get problematic > when authorization is implemented, using separate ACLs for endpoints that can > be GETed and endpoints that can be POSTed to. > Validation of the request method should be added to all endpoints, e.g. > "/state" should return a 405 (Method Not Allowed) when POSTed to. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5340) libevent builds may prevent new connections
[ https://issues.apache.org/jira/browse/MESOS-5340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Toenshoff updated MESOS-5340: -- Description: When using an SSL-enabled build of Mesos in combination with SSL-downgrading support, any connection that does not actually transmit data will hang the runnable (e.g. master). For reproducing the issue (on any platform)... Spin up a master with enabled SSL-downgrading: {noformat} $ export SSL_ENABLED=true $ export SSL_SUPPORT_DOWNGRADE=true $ export SSL_KEY_FILE=/path/to/your/foo.key $ export SSL_CERT_FILE=/path/to/your/foo.crt $ export SSL_CA_FILE=/path/to/your/ca.crt $ ./bin/mesos-master.sh --work_dir=/tmp/foo {noformat} Create some artificial HTTP request load for quickly spotting the problem in both, the master logs as well as the output of CURL itself: {noformat} $ while true; do sleep 0.1; echo $( date +">%H:%M:%S.%3N"; curl -s -k -A "SSL Debug" http://localhost:5050/master/slaves; echo ;date +"<%H:%M:%S.%3N"; echo); done {noformat} Now create a connection to the master that does not transmit any data: {noformat} $ telnet localhost 5050 {noformat} You should now see the CURL requests hanging, the master stops responding to new connections. This will persist until either some data is transmitted via the above telnet connection or it is closed. This problem has initially been observed when running Mesos on an AWS cluster with enabled load-balancer (which uses an idle, persistent connection) for the master node. Such connection does naturally not transmit any data as long as there are no external requests routed via the load-balancer. AWS allows setting up a timeout for those connections and in our test environment, this duration was set to 60 seconds and hence we were seeing our master getting repetitively unresponsive for 60 seconds, then getting "unstuck" for a brief period until it got stuck again. was: When using an SSL-enabled build of Mesos in combination with SSL-downgrading support, any connection that does not actually transmit data will hang the runnable (e.g. master). For reproducing the issue (on any platform)... Spin up a master with enabled SSL-downgrading: {noformat} $ export SSL_ENABLED=true $ export SSL_SUPPORT_DOWNGRADE=true $ export SSL_KEY_FILE=/path/to/your/foo.key $ export SSL_CERT_FILE=/path/to/your/foo.crt $ export SSL_CA_FILE=/path/to/your/ca.crt $ ./bin/mesos-master.sh --work_dir=/tmp/foo {noformat} Create some artificial HTTP request load for quickly spotting the problem in both, the master logs as well as the output of CURL itself: {noformat} $ while true; do sleep 0.1; echo $( date +">%H:%M:%S.%3N"; curl -s -k -A "SSL Debug" http://localhost:5050/master/slaves; echo ;date +"<%H:%M:%S.%3N"; echo); done {noformat} Now create a connection to the master that does not transmit any data: {noformat} $ telnet localhost 5050 {noformat} You should now see the CURL requests hanging, the master stops responding to new connections. This will persist until either some data is transmitted via the above telnet connection or it is closed. This problem has initially been observed when running Mesos on an AWS cluster with enabled internal ELB health-checks for the master node. Those health-checks are using long-lasting connections that do not transmit any data and are closed after a configurable duration. In our test environment, this duration was set to 60 seconds and hence we were seeing our master getting repetitively unresponsive for 60 seconds, then getting "unstuck" for a brief period until it got stuck again. > libevent builds may prevent new connections > --- > > Key: MESOS-5340 > URL: https://issues.apache.org/jira/browse/MESOS-5340 > Project: Mesos > Issue Type: Bug > Components: security >Affects Versions: 0.29.0, 0.28.1 >Reporter: Till Toenshoff >Priority: Blocker > Labels: mesosphere, security, ssl > > When using an SSL-enabled build of Mesos in combination with SSL-downgrading > support, any connection that does not actually transmit data will hang the > runnable (e.g. master). > For reproducing the issue (on any platform)... > Spin up a master with enabled SSL-downgrading: > {noformat} > $ export SSL_ENABLED=true > $ export SSL_SUPPORT_DOWNGRADE=true > $ export SSL_KEY_FILE=/path/to/your/foo.key > $ export SSL_CERT_FILE=/path/to/your/foo.crt > $ export SSL_CA_FILE=/path/to/your/ca.crt > $ ./bin/mesos-master.sh --work_dir=/tmp/foo > {noformat} > Create some artificial HTTP request load for quickly spotting the problem in > both, the master logs as well as the output of CURL itself: > {noformat} > $ while true; do sleep 0.1; echo $( date +">%H:%M:%S.%3N"; curl -s -k -A "SSL > Debug" http://localhost:5050/master/slaves; echo ;date +"<%H:%M:%S.%3N"; > echo); done > {noformat} >
[jira] [Commented] (MESOS-5345) Design doc for TASK_GONE
[ https://issues.apache.org/jira/browse/MESOS-5345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276305#comment-15276305 ] Neil Conway commented on MESOS-5345: Initial design doc: https://docs.google.com/document/d/1AWpb-tXb53FEaPSzRAdaAS3aTBQ9b_wAZx1L0pQ_s-s > Design doc for TASK_GONE > > > Key: MESOS-5345 > URL: https://issues.apache.org/jira/browse/MESOS-5345 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Neil Conway > Labels: mesosphere > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5345) Design doc for TASK_GONE
Neil Conway created MESOS-5345: -- Summary: Design doc for TASK_GONE Key: MESOS-5345 URL: https://issues.apache.org/jira/browse/MESOS-5345 Project: Mesos Issue Type: Task Components: master Reporter: Neil Conway -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5344) Introduce TASK_GONE task status
Neil Conway created MESOS-5344: -- Summary: Introduce TASK_GONE task status Key: MESOS-5344 URL: https://issues.apache.org/jira/browse/MESOS-5344 Project: Mesos Issue Type: Epic Components: master Reporter: Neil Conway The TASK_LOST task status describes two different situations: (a) the task was not launched because of an error (e.g., insufficient available resources), or (b) the master lost contact with a running task (e.g., due to a network partition); the master will kill the task when it can (e.g., when the network partition heals), but in the meantime the task may still be running. This has two problems: 1. Using the same task status for two fairly different situations is confusing. 2. In the partitioned-but-still-running case, frameworks have no easy way to determine when a task has truly terminated. To address these problems, we propose introducing a new task status, TASK_GONE, which would be used whenever a task can be guaranteed to not be running. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5343) Behavior of custom HTTP authenticators with disabled HTTP authentication is inconsistent between master and agent
Benjamin Bannier created MESOS-5343: --- Summary: Behavior of custom HTTP authenticators with disabled HTTP authentication is inconsistent between master and agent Key: MESOS-5343 URL: https://issues.apache.org/jira/browse/MESOS-5343 Project: Mesos Issue Type: Bug Reporter: Benjamin Bannier When setting a custom authenticator with {{http_authenticators}} and also specifying {{authenticate_http=false}} currently agents refuse to start with {code} A custom HTTP authenticator was specified with the '--http_authenticators' flag, but HTTP authentication was not enabled via '--authenticate_http' {code} Masters on the other hand accept this setting. Having differing behavior between master and agents is confusing, and we should decide on whether we want to accept these settings or not, and make the implementations consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris updated MESOS-5342: - Description: The cgroups isolator currently lacks support for binding (also called pinning) containers to a set of cores. The GNU/Linux kernel is known to make sub-optimal core assignments for processes and threads. Poor assignments impact program performance, specifically in terms of cache locality. Applications requiring GPU resources can benefit from this feature by getting access to cores closest to the GPU hardware, which reduces cpu-gpu copy latency. Most cluster management systems from the HPC community (SLURM) provide both cgroup isolation and cpu binding. This feature would provide similar capabilities. The current interest in supporting Intel's Cache Allocation Technology, and the advent of Intel's Knights-series processors, will require making choices about where container's are going to run on the mesos-agent's processor(s) cores - this feature is a step toward developing a robust solution. The improvement in this JIRA ticket will handle hardware topology detection, track container-to-core utilization in a histogram, and use a mathematical optimization technique to select cores for container assignment based on latency and the container-to-core utilization histogram. For GPU tasks, the improvement will prioritize selection of cores based on latency between the GPU and cores in an effort to minimize copy latency. was: The cgroups isolator currently lacks support for binding (also called pinning) containers to a set of cores. The GNU/Linux kernel is known to make sub-optimal core assignments for processes and threads. Poor assignments impact program performance,specifically in terms of cache locality. Applications requiring GPU resources can benefit from this feature by getting access to cores closest to the GPU hardware, which reduces cpu-gpu copy latency. Most cluster management systems from the HPC community (SLURM) provide both cgroup isolation and cpu binding. This feature would provide similar capabilities. The current interest in supporting Intel's Cache Allocation Technology will require making choices about where container's are going to run on the mesos-agent's processor(s) - this feature is a step toward developing a robust solution. The improvement in this JIRA ticket will handle hardware topology detection, track container-to-core utilization in a histogram, and use a mathematical optimization technique to select cores for container assignment based on latency and the container-to-core utilization histogram. For GPU tasks, the improvement will prioritize selection of cores based on latency between the GPU and cores in an effort to minimize copy latency. > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, specifically in terms of cache locality. > Applications requiring GPU resources can benefit from this feature by getting > access to cores closest to the GPU hardware, which reduces cpu-gpu copy > latency. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology, and the advent of Intel's Knights-series processors, will require > making choices about where container's are going to run on the mesos-agent's > processor(s) cores - this feature is a step toward developing a robust > solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris updated MESOS-5342: - Description: The cgroups isolator currently lacks support for binding (also called pinning) containers to a set of cores. The GNU/Linux kernel is known to make sub-optimal core assignments for processes and threads. Poor assignments impact program performance,specifically in terms of cache locality. Applications requiring GPU resources can benefit from this feature by getting access to cores closest to the GPU hardware, which reduces cpu-gpu copy latency. Most cluster management systems from the HPC community (SLURM) provide both cgroup isolation and cpu binding. This feature would provide similar capabilities. The current interest in supporting Intel's Cache Allocation Technology will require making choices about where container's are going to run on the mesos-agent's processor(s) - this feature is a step toward developing a robust solution. The improvement in this JIRA ticket will handle hardware topology detection, track container-to-core utilization in a histogram, and use a mathematical optimization technique to select cores for container assignment based on latency and the container-to-core utilization histogram. For GPU tasks, the improvement will prioritize selection of cores based on latency between the GPU and cores in an effort to minimize copy latency. was: The cgroups isolator currently lacks support for binding (also called pinning) containers to a set of cores. The GNU/Linux kernel is known to make sub-optimal core assignments for processes and threads. Poor assignments impact program performance, particularly in the case of applications requiring GPU resources. Most cluster management systems from the HPC community (SLURM) provide both cgroup isolation and cpu binding. This feature would provide similar capabilities. The current interest in supporting Intel's Cache Allocation Technology will require making choices about where container's are going to run on the mesos-agent's processor(s) - this feature is a step toward developing a robust solution. The improvement in this JIRA ticket will handle hardware topology detection, track container-to-core utilization in a histogram, and use a mathematical optimization technique to select cores for container assignment based on latency and the container-to-core utilization histogram. For GPU tasks, the improvement will prioritize selection of cores based on latency between the GPU and cores in an effort to minimize copy latency. > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance,specifically in terms of cache locality. > Applications requiring GPU resources can benefit from this feature by getting > access to cores closest to the GPU hardware, which reduces cpu-gpu copy > latency. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology will require making choices about where container's are going to > run on the mesos-agent's processor(s) - this feature is a step toward > developing a robust solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276195#comment-15276195 ] Chris commented on MESOS-5342: -- implementation has been posted to review board. successfully passed 'make check'. requires use of the hwloc library to perform machine hardware topology discovery and cpu binding. updates were made to configure.ac and Makefile.am. implementation is a new "device" under the cgroups isolator directory called "hwloc". Implementation detects topology, computes total number of cores required by the container (also checks if the container requires gpu). if the container requires gpu, the topology information is used to find the "closest" cores based on latency. if the container only requires cpu, a histogram of task assignment to cores is checked. if the histogram is "empty" (all cores have a value of 1.0) then a random core is selected and the latency matrix is used to find cores that are "closest" to the random core. the histogram is updated. If the histogram is "not empty" then a greedy submodular subset selection algorithm is used to select N cores using the latency matrix and a "per-core" cost value. the "per-core" cost value is a normalized version of the histogram divided by the number of processing units available on each core. greedy submodular subset selection algorithms use a "diminishing returns property" to find an optimal subset of items under a knapsack constraint. when the list of cores is returned, a bit vector representing a cpuset is bound to the container's pid_t. when the container is cleaned up, the histogram is updated by reducing the current task counts on each core assigned to the pid_t by -1.0. > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, particularly in the case of applications > requiring GPU resources. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology will require making choices about where container's are going to > run on the mesos-agent's processor(s) - this feature is a step toward > developing a robust solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris updated MESOS-5342: - Comment: was deleted (was: implementation has been posted to review board. successfully passed 'make check'. requires use of the hwloc library to perform machine hardware topology discovery and cpu binding. updates were made to configure.ac and Makefile.am. implementation is a new "device" under the cgroups isolator directory called "hwloc". Implementation detects topology, computes total number of cores required by the container (also checks if the container requires gpu). if the container requires gpu, the topology information is used to find the "closest" cores based on latency. If the container only requires cpu, a histogram of task assignment to cores is checked. If the histogram is "empty" (all cores have a value of 1.0) then a random core is selected and the latency matrix is used to find cores that are "closest" to the random core. The histogram is updated. If the histogram is "not empty" then a greedy submodular subset selection algorithm is used to select N cores using the latency matrix and a "per-core" cost value. The "per-core" cost value is a normalized version of the histogram divided by the number of processing units available on each core. Greedy submodular subset selection algorithms use a "diminishing returns property" to find an optimal subset of items under a knapsack constraint. When the list of cores is returned, a bit vector representing a cpuset is bound to the container's pid_t. When the container is cleaned up, the histogram is updated by reducing the current task counts on each core assigned to the pid_t by -1.0.) > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, particularly in the case of applications > requiring GPU resources. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology will require making choices about where container's are going to > run on the mesos-agent's processor(s) - this feature is a step toward > developing a robust solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276194#comment-15276194 ] Chris commented on MESOS-5342: -- implementation has been posted to review board. successfully passed 'make check'. requires use of the hwloc library to perform machine hardware topology discovery and cpu binding. updates were made to configure.ac and Makefile.am. implementation is a new "device" under the cgroups isolator directory called "hwloc". Implementation detects topology, computes total number of cores required by the container (also checks if the container requires gpu). if the container requires gpu, the topology information is used to find the "closest" cores based on latency. If the container only requires cpu, a histogram of task assignment to cores is checked. If the histogram is "empty" (all cores have a value of 1.0) then a random core is selected and the latency matrix is used to find cores that are "closest" to the random core. The histogram is updated. If the histogram is "not empty" then a greedy submodular subset selection algorithm is used to select N cores using the latency matrix and a "per-core" cost value. The "per-core" cost value is a normalized version of the histogram divided by the number of processing units available on each core. Greedy submodular subset selection algorithms use a "diminishing returns property" to find an optimal subset of items under a knapsack constraint. When the list of cores is returned, a bit vector representing a cpuset is bound to the container's pid_t. When the container is cleaned up, the histogram is updated by reducing the current task counts on each core assigned to the pid_t by -1.0. > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, particularly in the case of applications > requiring GPU resources. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology will require making choices about where container's are going to > run on the mesos-agent's processor(s) - this feature is a step toward > developing a robust solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5292) Website references non-existing file NewbieContributionOverview.jpg
[ https://issues.apache.org/jira/browse/MESOS-5292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joerg Schad updated MESOS-5292: --- Attachment: Screen Shot 2016-05-09 at 12.07.09.png Broken rendered page. > Website references non-existing file NewbieContributionOverview.jpg > --- > > Key: MESOS-5292 > URL: https://issues.apache.org/jira/browse/MESOS-5292 > Project: Mesos > Issue Type: Bug > Components: project website >Reporter: Benjamin Bannier > Labels: site > Attachments: Screen Shot 2016-05-09 at 12.07.09.png > > > The website references the non-existing file > {{NewbieContributionOverview.jpg}} in {{docs/newbie-guide.md}}. Looking at > the commit adding this documentation it appears this file was never added to > the repository. We should either provide the file or rewrite the section to > work without needing to reference it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris updated MESOS-5342: - Description: The cgroups isolator currently lacks support for binding (also called pinning) containers to a set of cores. The GNU/Linux kernel is known to make sub-optimal core assignments for processes and threads. Poor assignments impact program performance, particularly in the case of applications requiring GPU resources. Most cluster management systems from the HPC community (SLURM) provide both cgroup isolation and cpu binding. This feature would provide similar capabilities. The current interest in supporting Intel's Cache Allocation Technology will require making choices about where container's are going to run on the mesos-agent's processor(s) - this feature is a step toward developing a robust solution. The improvement in this JIRA ticket will handle hardware topology detection, track container-to-core utilization in a histogram, and use a mathematical optimization technique to select cores for container assignment based on latency and the container-to-core utilization histogram. For GPU tasks, the improvement will prioritize selection of cores based on latency between the GPU and cores in an effort to minimize copy latency. was: The cgroups isolator currently lacks support for binding (also called pinning) containers to a set of cores. The GNU/Linux kernel is known to make sub-optimal core assignments for processes and threads. Poor assignments impact program performance; particularly in the case of applications requiring GPU resources. Most cluster management systems from the HPC community (SLURM) provide both cgroup isolation and cpu binding. This feature would provide similar capabilities. The current interest in supporting Intel's Cache Allocation Technology will require making choices about where container's are going to run on the mesos-agent's processor(s) - this feature is a step toward developing a robust solution. The improvement in this JIRA ticket will handle hardware topology detection, track container-to-core utilization in a histogram, and use a mathematical optimization technique to select cores for container assignment based on latency and the container-to-core utilization histogram. For GPU tasks, the improvement will prioritize selection of cores based on latency between the GPU and cores in an effort to minimize copy latency. > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, particularly in the case of applications > requiring GPU resources. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology will require making choices about where container's are going to > run on the mesos-agent's processor(s) - this feature is a step toward > developing a robust solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5292) Website references non-existing file NewbieContributionOverview.jpg
[ https://issues.apache.org/jira/browse/MESOS-5292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276172#comment-15276172 ] Joerg Schad commented on MESOS-5292: I created https://reviews.apache.org/r/47115/ to remove the reference to the image for now (feel free to re-add after the image has been added). The review also fixes some other style issue with the more strict website markdown renderer (e.g., broken links etc). When doing changes to the documentation please check the documentation with the website docker generator (https://github.com/apache/mesos/blob/master/support/site-docker/README.md). > Website references non-existing file NewbieContributionOverview.jpg > --- > > Key: MESOS-5292 > URL: https://issues.apache.org/jira/browse/MESOS-5292 > Project: Mesos > Issue Type: Bug > Components: project website >Reporter: Benjamin Bannier > Labels: site > > The website references the non-existing file > {{NewbieContributionOverview.jpg}} in {{docs/newbie-guide.md}}. Looking at > the commit adding this documentation it appears this file was never added to > the repository. We should either provide the file or rewrite the section to > work without needing to reference it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5340) libevent builds may prevent new connections
[ https://issues.apache.org/jira/browse/MESOS-5340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Toenshoff updated MESOS-5340: -- Summary: libevent builds may prevent new connections (was: SSL-downgrading support may prevent new connections) > libevent builds may prevent new connections > --- > > Key: MESOS-5340 > URL: https://issues.apache.org/jira/browse/MESOS-5340 > Project: Mesos > Issue Type: Bug > Components: security >Affects Versions: 0.29.0, 0.28.1 >Reporter: Till Toenshoff >Priority: Blocker > Labels: mesosphere, security, ssl > > When using an SSL-enabled build of Mesos in combination with SSL-downgrading > support, any connection that does not actually transmit data will hang the > runnable (e.g. master). > For reproducing the issue (on any platform)... > Spin up a master with enabled SSL-downgrading: > {noformat} > $ export SSL_ENABLED=true > $ export SSL_SUPPORT_DOWNGRADE=true > $ export SSL_KEY_FILE=/path/to/your/foo.key > $ export SSL_CERT_FILE=/path/to/your/foo.crt > $ export SSL_CA_FILE=/path/to/your/ca.crt > $ ./bin/mesos-master.sh --work_dir=/tmp/foo > {noformat} > Create some artificial HTTP request load for quickly spotting the problem in > both, the master logs as well as the output of CURL itself: > {noformat} > $ while true; do sleep 0.1; echo $( date +">%H:%M:%S.%3N"; curl -s -k -A "SSL > Debug" http://localhost:5050/master/slaves; echo ;date +"<%H:%M:%S.%3N"; > echo); done > {noformat} > Now create a connection to the master that does not transmit any data: > {noformat} > $ telnet localhost 5050 > {noformat} > You should now see the CURL requests hanging, the master stops responding to > new connections. This will persist until either some data is transmitted via > the above telnet connection or it is closed. > This problem has initially been observed when running Mesos on an AWS cluster > with enabled internal ELB health-checks for the master node. Those > health-checks are using long-lasting connections that do not transmit any > data and are closed after a configurable duration. In our test environment, > this duration was set to 60 seconds and hence we were seeing our master > getting repetitively unresponsive for 60 seconds, then getting "unstuck" for > a brief period until it got stuck again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3235) FetcherCacheHttpTest.HttpCachedSerialized and FetcherCacheHttpTest.HttpCachedConcurrent are flaky
[ https://issues.apache.org/jira/browse/MESOS-3235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276131#comment-15276131 ] haosdent commented on MESOS-3235: - Got it, {{FetcherCacheHttpTest.HttpCachedConcurrent}} always could finished in 1s in my machine. But it take more time(more than 15 seconds) in my slow vm. According to log, it spent most time in the subprocess call, e.g. launch {{mesos-fetcher}} and {{mesos-executor}}. I have not yet investigate why {{subprocess}} so slow. Let me got more details about this and rework the patch. Thanks a lot for your comments! > FetcherCacheHttpTest.HttpCachedSerialized and > FetcherCacheHttpTest.HttpCachedConcurrent are flaky > - > > Key: MESOS-3235 > URL: https://issues.apache.org/jira/browse/MESOS-3235 > Project: Mesos > Issue Type: Bug > Components: fetcher, tests >Affects Versions: 0.23.0 >Reporter: Joseph Wu >Assignee: haosdent > Labels: mesosphere > Fix For: 0.27.0 > > Attachments: fetchercache_log_centos_6.txt > > > On OSX, {{make clean && make -j8 V=0 check}}: > {code} > [--] 3 tests from FetcherCacheHttpTest > [ RUN ] FetcherCacheHttpTest.HttpCachedSerialized > HTTP/1.1 200 OK > Date: Fri, 07 Aug 2015 17:23:05 GMT > Content-Length: 30 > I0807 10:23:05.673596 2085372672 exec.cpp:133] Version: 0.24.0 > E0807 10:23:05.675884 184373248 socket.hpp:173] Shutdown failed on fd=18: > Socket is not connected [57] > I0807 10:23:05.675897 182226944 exec.cpp:207] Executor registered on slave > 20150807-102305-139395082-52338-52313-S0 > E0807 10:23:05.683980 184373248 socket.hpp:173] Shutdown failed on fd=18: > Socket is not connected [57] > Registered executor on 10.0.79.8 > Starting task 0 > Forked command at 54363 > sh -c './mesos-fetcher-test-cmd 0' > E0807 10:23:05.694953 184373248 socket.hpp:173] Shutdown failed on fd=18: > Socket is not connected [57] > Command exited with status 0 (pid: 54363) > E0807 10:23:05.793927 184373248 socket.hpp:173] Shutdown failed on fd=18: > Socket is not connected [57] > I0807 10:23:06.590008 2085372672 exec.cpp:133] Version: 0.24.0 > E0807 10:23:06.592244 355938304 socket.hpp:173] Shutdown failed on fd=18: > Socket is not connected [57] > I0807 10:23:06.592243 353255424 exec.cpp:207] Executor registered on slave > 20150807-102305-139395082-52338-52313-S0 > E0807 10:23:06.597995 355938304 socket.hpp:173] Shutdown failed on fd=18: > Socket is not connected [57] > Registered executor on 10.0.79.8 > Starting task 1 > Forked command at 54411 > sh -c './mesos-fetcher-test-cmd 1' > E0807 10:23:06.608708 355938304 socket.hpp:173] Shutdown failed on fd=18: > Socket is not connected [57] > Command exited with status 0 (pid: 54411) > E0807 10:23:06.707649 355938304 socket.hpp:173] Shutdown failed on fd=18: > Socket is not connected [57] > ../../src/tests/fetcher_cache_tests.cpp:860: Failure > Failed to wait 15secs for awaitFinished(task.get()) > *** Aborted at 1438968214 (unix time) try "date -d @1438968214" if you are > using GNU date *** > [ FAILED ] FetcherCacheHttpTest.HttpCachedSerialized (28685 ms) > [ RUN ] FetcherCacheHttpTest.HttpCachedConcurrent > PC: @0x113723618 process::Owned<>::get() > *** SIGSEGV (@0x0) received by PID 52313 (TID 0x118d59000) stack trace: *** > @ 0x7fff8fcacf1a _sigtramp > @ 0x7f9bc3109710 (unknown) > @0x1136f07e2 mesos::internal::slave::Fetcher::fetch() > @0x113862f9d > mesos::internal::slave::MesosContainerizerProcess::fetch() > @0x1138f1b5d > _ZZN7process8dispatchI7NothingN5mesos8internal5slave25MesosContainerizerProcessERKNS2_11ContainerIDERKNS2_11CommandInfoERKNSt3__112basic_stringIcNSC_11char_traitsIcEENSC_9allocatorIcRK6OptionISI_ERKNS2_7SlaveIDES6_S9_SI_SM_SP_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSW_FSU_T1_T2_T3_T4_T5_ET6_T7_T8_T9_T10_ENKUlPNS_11ProcessBaseEE_clES1D_ > @0x1138f18cf > _ZNSt3__110__function6__funcIZN7process8dispatchI7NothingN5mesos8internal5slave25MesosContainerizerProcessERKNS5_11ContainerIDERKNS5_11CommandInfoERKNS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcRK6OptionISK_ERKNS5_7SlaveIDES9_SC_SK_SO_SR_EENS2_6FutureIT_EERKNS2_3PIDIT0_EEMSY_FSW_T1_T2_T3_T4_T5_ET6_T7_T8_T9_T10_EUlPNS2_11ProcessBaseEE_NSI_IS1G_EEFvS1F_EEclEOS1F_ > @0x1143768cf std::__1::function<>::operator()() > @0x11435ca7f process::ProcessBase::visit() > @0x1143ed6fe process::DispatchEvent::visit() > @0x11271 process::ProcessBase::serve() > @0x114343b4e process::ProcessManager::resume() > @0x1143431ca process::internal::schedule() > @0x1143da646 _ZNSt3__114__thread_proxyINS_5tupleIJPFvvEEPvS5_ > @ 0x7fff95090268
[jira] [Commented] (MESOS-3235) FetcherCacheHttpTest.HttpCachedSerialized and FetcherCacheHttpTest.HttpCachedConcurrent are flaky
[ https://issues.apache.org/jira/browse/MESOS-3235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276116#comment-15276116 ] Bernd Mathiske commented on MESOS-3235: --- It seems doubtful that lengthening the wait time for task completion solves much, since successful runs are way shorter than the default 15 seconds, typically in the low single digit second range. Machines aren't that slow, are they? And we get these failures on machines that are known to be fast occasionally as well. I suspect something else is wrong here. What I have seen in failure logs is that one task somehow has not produced status updates all the way up to the AWAIT statement in question - although it must have reached the contention barrier which asserts that all tasks have been launched as the fetcher has been observed downloading every script. So one guess is that something is blocking/eating/delaying status updates at some stage - occasionally. In all the cases I have seen the tasks are not launched in serial order. And that's exactly why I wrote this test! So we can see if we are dealing with concurrency correctly. Too bad we don't know what's failing yet. If we had a way to reproduce this behavior more often, we could switch on more logging and just repeat the test often enough to find something. But repeating the test tends to make the problem go away. Ideas? > FetcherCacheHttpTest.HttpCachedSerialized and > FetcherCacheHttpTest.HttpCachedConcurrent are flaky > - > > Key: MESOS-3235 > URL: https://issues.apache.org/jira/browse/MESOS-3235 > Project: Mesos > Issue Type: Bug > Components: fetcher, tests >Affects Versions: 0.23.0 >Reporter: Joseph Wu >Assignee: haosdent > Labels: mesosphere > Fix For: 0.27.0 > > Attachments: fetchercache_log_centos_6.txt > > > On OSX, {{make clean && make -j8 V=0 check}}: > {code} > [--] 3 tests from FetcherCacheHttpTest > [ RUN ] FetcherCacheHttpTest.HttpCachedSerialized > HTTP/1.1 200 OK > Date: Fri, 07 Aug 2015 17:23:05 GMT > Content-Length: 30 > I0807 10:23:05.673596 2085372672 exec.cpp:133] Version: 0.24.0 > E0807 10:23:05.675884 184373248 socket.hpp:173] Shutdown failed on fd=18: > Socket is not connected [57] > I0807 10:23:05.675897 182226944 exec.cpp:207] Executor registered on slave > 20150807-102305-139395082-52338-52313-S0 > E0807 10:23:05.683980 184373248 socket.hpp:173] Shutdown failed on fd=18: > Socket is not connected [57] > Registered executor on 10.0.79.8 > Starting task 0 > Forked command at 54363 > sh -c './mesos-fetcher-test-cmd 0' > E0807 10:23:05.694953 184373248 socket.hpp:173] Shutdown failed on fd=18: > Socket is not connected [57] > Command exited with status 0 (pid: 54363) > E0807 10:23:05.793927 184373248 socket.hpp:173] Shutdown failed on fd=18: > Socket is not connected [57] > I0807 10:23:06.590008 2085372672 exec.cpp:133] Version: 0.24.0 > E0807 10:23:06.592244 355938304 socket.hpp:173] Shutdown failed on fd=18: > Socket is not connected [57] > I0807 10:23:06.592243 353255424 exec.cpp:207] Executor registered on slave > 20150807-102305-139395082-52338-52313-S0 > E0807 10:23:06.597995 355938304 socket.hpp:173] Shutdown failed on fd=18: > Socket is not connected [57] > Registered executor on 10.0.79.8 > Starting task 1 > Forked command at 54411 > sh -c './mesos-fetcher-test-cmd 1' > E0807 10:23:06.608708 355938304 socket.hpp:173] Shutdown failed on fd=18: > Socket is not connected [57] > Command exited with status 0 (pid: 54411) > E0807 10:23:06.707649 355938304 socket.hpp:173] Shutdown failed on fd=18: > Socket is not connected [57] > ../../src/tests/fetcher_cache_tests.cpp:860: Failure > Failed to wait 15secs for awaitFinished(task.get()) > *** Aborted at 1438968214 (unix time) try "date -d @1438968214" if you are > using GNU date *** > [ FAILED ] FetcherCacheHttpTest.HttpCachedSerialized (28685 ms) > [ RUN ] FetcherCacheHttpTest.HttpCachedConcurrent > PC: @0x113723618 process::Owned<>::get() > *** SIGSEGV (@0x0) received by PID 52313 (TID 0x118d59000) stack trace: *** > @ 0x7fff8fcacf1a _sigtramp > @ 0x7f9bc3109710 (unknown) > @0x1136f07e2 mesos::internal::slave::Fetcher::fetch() > @0x113862f9d > mesos::internal::slave::MesosContainerizerProcess::fetch() > @0x1138f1b5d > _ZZN7process8dispatchI7NothingN5mesos8internal5slave25MesosContainerizerProcessERKNS2_11ContainerIDERKNS2_11CommandInfoERKNSt3__112basic_stringIcNSC_11char_traitsIcEENSC_9allocatorIcRK6OptionISI_ERKNS2_7SlaveIDES6_S9_SI_SM_SP_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSW_FSU_T1_T2_T3_T4_T5_ET6_T7_T8_T9_T10_ENKUlPNS_11ProcessBaseEE_clES1D_ > @0x1138f18cf >