[jira] [Commented] (MESOS-7714) Fix agent downgrade for reservation refinement

2017-07-28 Thread Michael Park (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105962#comment-16105962
 ] 

Michael Park commented on MESOS-7714:
-

Ah, yes, and this ticket is for (2). Seems like we're on the same page now?

> Fix agent downgrade for reservation refinement
> --
>
> Key: MESOS-7714
> URL: https://issues.apache.org/jira/browse/MESOS-7714
> Project: Mesos
>  Issue Type: Bug
>Reporter: Michael Park
>Priority: Blocker
>
> The agent code only partially supports downgrading of an agent correctly.
> The checkpointed resources are done correctly, but the resources within
> the {{SlaveInfo}} message as well as tasks and executors also need to be 
> downgraded
> correctly and converted back on recovery.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7714) Fix agent downgrade for reservation refinement

2017-07-28 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105960#comment-16105960
 ] 

Yan Xu commented on MESOS-7714:
---

I mean when we are not using new features, so this appears to be 2). I didn't 
know the details until I just read the design doc and saw that you mentioned 
about the agent "On disk (checkpointing), it will also generally use the new 
Resources format, except for resources with a single dynamic reservation it 
will continue to checkpoint in the old Resource format."

> Fix agent downgrade for reservation refinement
> --
>
> Key: MESOS-7714
> URL: https://issues.apache.org/jira/browse/MESOS-7714
> Project: Mesos
>  Issue Type: Bug
>Reporter: Michael Park
>Priority: Blocker
>
> The agent code only partially supports downgrading of an agent correctly.
> The checkpointed resources are done correctly, but the resources within
> the {{SlaveInfo}} message as well as tasks and executors also need to be 
> downgraded
> correctly and converted back on recovery.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7714) Fix agent downgrade for reservation refinement

2017-07-28 Thread Michael Park (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105893#comment-16105893
 ] 

Michael Park commented on MESOS-7714:
-

In order to support (1) I think we'd have to checkpoint resources with refined 
reservations in a different location.
You're saying you wouldn't want to upgrade to 1.4 because you can't downgrade 
once people start using new features?
Just for comparison, we have the same limitations for multi-role support. That 
is, once you upgrade to 1.3 and
and frameworks start using multi-role, you can't downgrade.

> Fix agent downgrade for reservation refinement
> --
>
> Key: MESOS-7714
> URL: https://issues.apache.org/jira/browse/MESOS-7714
> Project: Mesos
>  Issue Type: Bug
>Reporter: Michael Park
>Priority: Blocker
>
> The agent code only partially supports downgrading of an agent correctly.
> The checkpointed resources are done correctly, but the resources within
> the {{SlaveInfo}} message as well as tasks and executors also need to be 
> downgraded
> correctly and converted back on recovery.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7215) Race condition on re-registration of non-partition-aware frameworks

2017-07-28 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-7215:

Target Version/s: 1.2.2, 1.4.0, 1.3.2  (was: 1.2.2, 1.4.0)

> Race condition on re-registration of non-partition-aware frameworks
> ---
>
> Key: MESOS-7215
> URL: https://issues.apache.org/jira/browse/MESOS-7215
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Yan Xu
>Assignee: Megha Sharma
>Priority: Critical
>
> Prior to the partition-awareness work MESOS-5344, upon agent reregistration 
> after it has been removed, the master only sends ShutdownFrameworkMessages to 
> the agent for frameworks that it knows have been torn down. 
> With the new logic in MESOS-5344, Mesos is now sending 
> {{ShutdownFrameworkMessages}} to the agent for all non-partition-aware 
> frameworks (including the ones that are still registered)
> This is problematic. The offer from this agent can still go to the same 
> framework which can then launch new tasks. The agent then receives tasks of 
> the same framework and ignores them because it thinks the framework is 
> shutting down. The framework is not shutting down of course, so from the 
> master and the scheduler's perspective the task is pending in STAGING forever 
> until the next agent reregistration, which could happen much later.
> This also makes the semantics of `ShutdownFrameworkMessage` ambiguous: the 
> agent is assuming the framework to be going away (and act accordingly) when 
> it's not. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7652) Docker image with universal containerizer does not work if WORKDIR is missing in the rootfs.

2017-07-28 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-7652:

Target Version/s: 1.2.2, 1.4.0, 1.3.2  (was: 1.2.2, 1.4.0)

> Docker image with universal containerizer does not work if WORKDIR is missing 
> in the rootfs.
> 
>
> Key: MESOS-7652
> URL: https://issues.apache.org/jira/browse/MESOS-7652
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.2.1
>Reporter: michael beisiegel
>Assignee: Gilbert Song
>Priority: Critical
>  Labels: mesosphere
>
> hello,
> used the following docker image recently
> quay.io/spinnaker/front50:master
> https://quay.io/repository/spinnaker/front50
> Here the link to the Dockerfile
> https://github.com/spinnaker/front50/blob/master/Dockerfile
> and here the source
> {color:blue}FROM java:8
> MAINTAINER delivery-engineer...@netflix.com
> COPY . workdir/
> WORKDIR workdir
> RUN GRADLE_USER_HOME=cache ./gradlew buildDeb -x test && \
>   dpkg -i ./front50-web/build/distributions/*.deb && \
>   cd .. && \
>   rm -rf workdir
> CMD ["/opt/front50/bin/front50"]{color}
> The image works fine with the docker containerizer, but the universal 
> containerizer shows the following in stderr.
> "Failed to chdir into current working directory '/workdir': No such file or 
> directory"
> The problem comes from the fact that the Dockerfile creates a workdir but 
> then later removes the created dir as part of a RUN. The docker containerizer 
> has no problem with it if you do
> docker run -ti --rm quay.io/spinnaker/front50:master bash
> you get into the working dir, but the universal containerizer fails with the 
> error.
> thanks for your help,
> Michael



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7652) Docker image with universal containerizer does not work if WORKDIR is missing in the rootfs.

2017-07-28 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-7652:

Target Version/s: 1.2.2, 1.4.0  (was: 1.2.2, 1.3.1, 1.4.0)

> Docker image with universal containerizer does not work if WORKDIR is missing 
> in the rootfs.
> 
>
> Key: MESOS-7652
> URL: https://issues.apache.org/jira/browse/MESOS-7652
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.2.1
>Reporter: michael beisiegel
>Assignee: Gilbert Song
>Priority: Critical
>  Labels: mesosphere
>
> hello,
> used the following docker image recently
> quay.io/spinnaker/front50:master
> https://quay.io/repository/spinnaker/front50
> Here the link to the Dockerfile
> https://github.com/spinnaker/front50/blob/master/Dockerfile
> and here the source
> {color:blue}FROM java:8
> MAINTAINER delivery-engineer...@netflix.com
> COPY . workdir/
> WORKDIR workdir
> RUN GRADLE_USER_HOME=cache ./gradlew buildDeb -x test && \
>   dpkg -i ./front50-web/build/distributions/*.deb && \
>   cd .. && \
>   rm -rf workdir
> CMD ["/opt/front50/bin/front50"]{color}
> The image works fine with the docker containerizer, but the universal 
> containerizer shows the following in stderr.
> "Failed to chdir into current working directory '/workdir': No such file or 
> directory"
> The problem comes from the fact that the Dockerfile creates a workdir but 
> then later removes the created dir as part of a RUN. The docker containerizer 
> has no problem with it if you do
> docker run -ti --rm quay.io/spinnaker/front50:master bash
> you get into the working dir, but the universal containerizer fails with the 
> error.
> thanks for your help,
> Michael



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7622) Agent can crash if a HTTP executor tries to retry subscription in running state.

2017-07-28 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-7622:

Target Version/s: 1.2.2, 1.3.2  (was: 1.2.2)

> Agent can crash if a HTTP executor tries to retry subscription in running 
> state.
> 
>
> Key: MESOS-7622
> URL: https://issues.apache.org/jira/browse/MESOS-7622
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, executor
>Reporter: Aaron Wood
>Assignee: Anand Mazumdar
>Priority: Blocker
>
> It is possible that a running executor might retry its subscribe request. 
> This can lead to a crash if it previously had any launched tasks. Note that 
> the executor would still be able to subscribe again when the agent process 
> restarts and is recovering.
> {code}
> sudo ./mesos-agent --master=10.0.2.15:5050 --work_dir=/tmp/slave 
> --isolation=cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime
>  --image_providers=docker --image_provisioner_backend=overlay 
> --containerizers=mesos --launcher_dir=$(pwd) 
> --executor_environment_variables='{"LD_LIBRARY_PATH": 
> "/home/aaron/Code/src/mesos/build/src/.libs"}'
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0605 14:58:23.748180 10710 main.cpp:323] Build: 2017-06-02 17:09:05 UTC by 
> aaron
> I0605 14:58:23.748252 10710 main.cpp:324] Version: 1.4.0
> I0605 14:58:23.755409 10710 systemd.cpp:238] systemd version `232` detected
> I0605 14:58:23.755450 10710 main.cpp:433] Initializing systemd state
> I0605 14:58:23.763049 10710 systemd.cpp:326] Started systemd slice 
> `mesos_executors.slice`
> I0605 14:58:23.763777 10710 resolver.cpp:69] Creating default secret resolver
> I0605 14:58:23.764214 10710 containerizer.cpp:230] Using isolation: 
> cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime,volume/image,environment_secret
> I0605 14:58:23.767192 10710 linux_launcher.cpp:150] Using 
> /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> E0605 14:58:23.770179 10710 shell.hpp:107] Command 'hadoop version 2>&1' 
> failed; this is the output:
> sh: 1: hadoop: not found
> I0605 14:58:23.770217 10710 fetcher.cpp:69] Skipping URI fetcher plugin 
> 'hadoop' as it could not be created: Failed to create HDFS client: Failed to 
> execute 'hadoop version 2>&1'; the command was either not found or exited 
> with a non-zero exit status: 127
> I0605 14:58:23.770643 10710 provisioner.cpp:255] Using default backend 
> 'overlay'
> I0605 14:58:23.785892 10710 slave.cpp:248] Mesos agent started on 
> (1)@127.0.1.1:5051
> I0605 14:58:23.785957 10710 slave.cpp:249] Flags at startup: 
> --appc_simple_discovery_uri_prefix="http://; 
> --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
> --authenticate_http_readwrite="false" --authenticatee="crammd5" 
> --authentication_backoff_factor="1secs" --authorizer="local" 
> --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
> --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
> --cgroups_root="mesos" --container_disk_watch_interval="15secs" 
> --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" 
> --docker="docker" --docker_kill_orphans="true" 
> --docker_registry="https://registry-1.docker.io; --docker_remove_delay="6hrs" 
> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" 
> --docker_store_dir="/tmp/mesos/store/docker" 
> --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
> --enforce_container_disk_quota="false" 
> --executor_environment_variables="{"LD_LIBRARY_PATH":"\/home\/aaron\/Code\/src\/mesos\/build\/src\/.libs"}"
>  --executor_registration_timeout="1mins" 
> --executor_reregistration_timeout="2secs" 
> --executor_shutdown_grace_period="5secs" 
> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" 
> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" 
> --hadoop_home="" --help="false" --hostname_lookup="true" 
> --http_command_executor="false" --http_heartbeat_interval="30secs" 
> --image_providers="docker" --image_provisioner_backend="overlay" 
> --initialize_driver_logging="true" 
> --isolation="cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime"
>  --launcher="linux" --launcher_dir="/home/aaron/Code/src/mesos/build/src" 
> --logbufsecs="0" --logging_level="INFO" --master="10.0.2.15:5050" 
> --max_completed_executors_per_framework="150" 
> --oversubscribed_resources_interval="15secs" --perf_duration="10secs" 
> --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" 
> --quiet="false" --recover="reconnect" --recovery_timeout="15mins" 
> --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" 
> 

[jira] [Updated] (MESOS-7374) Running DOCKER images in Mesos Container Runtime without `linux/filesystem` isolation enabled renders host unusable

2017-07-28 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-7374:

Target Version/s: 1.4.0, 1.3.2  (was: 1.4.0)

> Running DOCKER images in Mesos Container Runtime without `linux/filesystem` 
> isolation enabled renders host unusable
> ---
>
> Key: MESOS-7374
> URL: https://issues.apache.org/jira/browse/MESOS-7374
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.2.0
>Reporter: Tim Harper
>Assignee: Chun-Hung Hsiao
>Priority: Blocker
>  Labels: containerizer, mesosphere
>
> If I run the pod below (using Marathon 1.4.2) against a mesos agent that has 
> the flags (also below), then the overlay filesystem replaces the system root 
> mount, effectively rendering the host unusable until reboot.
> flags:
> - {{--containerizers mesos,docker}}
> - {{--image_providers APPC,DOCKER}}
> - {{--isolation cgroups/cpu,cgroups/mem,docker/runtime}}
> pod definition for Marathon:
> {code:java}
> {
>   "id": "/simplepod",
>   "scaling": { "kind": "fixed", "instances": 1 },
>   "containers": [
> {
>   "name": "sleep1",
>   "exec": { "command": { "shell": "sleep 1000" } },
>   "resources": { "cpus": 0.1, "mem": 32 },
>   "image": {
> "id": "alpine",
> "kind": "DOCKER"
>   }
> }
>   ],
>   "networks": [ {"mode": "host"} ]
> }
> {code}
> Mesos should probably check for this and avoid replacing the system root 
> mount point at startup or launch time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7374) Running DOCKER images in Mesos Container Runtime without `linux/filesystem` isolation enabled renders host unusable

2017-07-28 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-7374:

Target Version/s: 1.4.0,   (was: 1.3.1, 1.4.0)

> Running DOCKER images in Mesos Container Runtime without `linux/filesystem` 
> isolation enabled renders host unusable
> ---
>
> Key: MESOS-7374
> URL: https://issues.apache.org/jira/browse/MESOS-7374
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.2.0
>Reporter: Tim Harper
>Assignee: Chun-Hung Hsiao
>Priority: Blocker
>  Labels: containerizer, mesosphere
>
> If I run the pod below (using Marathon 1.4.2) against a mesos agent that has 
> the flags (also below), then the overlay filesystem replaces the system root 
> mount, effectively rendering the host unusable until reboot.
> flags:
> - {{--containerizers mesos,docker}}
> - {{--image_providers APPC,DOCKER}}
> - {{--isolation cgroups/cpu,cgroups/mem,docker/runtime}}
> pod definition for Marathon:
> {code:java}
> {
>   "id": "/simplepod",
>   "scaling": { "kind": "fixed", "instances": 1 },
>   "containers": [
> {
>   "name": "sleep1",
>   "exec": { "command": { "shell": "sleep 1000" } },
>   "resources": { "cpus": 0.1, "mem": 32 },
>   "image": {
> "id": "alpine",
> "kind": "DOCKER"
>   }
> }
>   ],
>   "networks": [ {"mode": "host"} ]
> }
> {code}
> Mesos should probably check for this and avoid replacing the system root 
> mount point at startup or launch time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7622) Agent can crash if a HTTP executor tries to retry subscription in running state.

2017-07-28 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-7622:

Target Version/s: 1.2.2  (was: 1.2.2, 1.3.1)

> Agent can crash if a HTTP executor tries to retry subscription in running 
> state.
> 
>
> Key: MESOS-7622
> URL: https://issues.apache.org/jira/browse/MESOS-7622
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, executor
>Reporter: Aaron Wood
>Assignee: Anand Mazumdar
>Priority: Blocker
>
> It is possible that a running executor might retry its subscribe request. 
> This can lead to a crash if it previously had any launched tasks. Note that 
> the executor would still be able to subscribe again when the agent process 
> restarts and is recovering.
> {code}
> sudo ./mesos-agent --master=10.0.2.15:5050 --work_dir=/tmp/slave 
> --isolation=cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime
>  --image_providers=docker --image_provisioner_backend=overlay 
> --containerizers=mesos --launcher_dir=$(pwd) 
> --executor_environment_variables='{"LD_LIBRARY_PATH": 
> "/home/aaron/Code/src/mesos/build/src/.libs"}'
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0605 14:58:23.748180 10710 main.cpp:323] Build: 2017-06-02 17:09:05 UTC by 
> aaron
> I0605 14:58:23.748252 10710 main.cpp:324] Version: 1.4.0
> I0605 14:58:23.755409 10710 systemd.cpp:238] systemd version `232` detected
> I0605 14:58:23.755450 10710 main.cpp:433] Initializing systemd state
> I0605 14:58:23.763049 10710 systemd.cpp:326] Started systemd slice 
> `mesos_executors.slice`
> I0605 14:58:23.763777 10710 resolver.cpp:69] Creating default secret resolver
> I0605 14:58:23.764214 10710 containerizer.cpp:230] Using isolation: 
> cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime,volume/image,environment_secret
> I0605 14:58:23.767192 10710 linux_launcher.cpp:150] Using 
> /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> E0605 14:58:23.770179 10710 shell.hpp:107] Command 'hadoop version 2>&1' 
> failed; this is the output:
> sh: 1: hadoop: not found
> I0605 14:58:23.770217 10710 fetcher.cpp:69] Skipping URI fetcher plugin 
> 'hadoop' as it could not be created: Failed to create HDFS client: Failed to 
> execute 'hadoop version 2>&1'; the command was either not found or exited 
> with a non-zero exit status: 127
> I0605 14:58:23.770643 10710 provisioner.cpp:255] Using default backend 
> 'overlay'
> I0605 14:58:23.785892 10710 slave.cpp:248] Mesos agent started on 
> (1)@127.0.1.1:5051
> I0605 14:58:23.785957 10710 slave.cpp:249] Flags at startup: 
> --appc_simple_discovery_uri_prefix="http://; 
> --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
> --authenticate_http_readwrite="false" --authenticatee="crammd5" 
> --authentication_backoff_factor="1secs" --authorizer="local" 
> --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
> --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
> --cgroups_root="mesos" --container_disk_watch_interval="15secs" 
> --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" 
> --docker="docker" --docker_kill_orphans="true" 
> --docker_registry="https://registry-1.docker.io; --docker_remove_delay="6hrs" 
> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" 
> --docker_store_dir="/tmp/mesos/store/docker" 
> --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
> --enforce_container_disk_quota="false" 
> --executor_environment_variables="{"LD_LIBRARY_PATH":"\/home\/aaron\/Code\/src\/mesos\/build\/src\/.libs"}"
>  --executor_registration_timeout="1mins" 
> --executor_reregistration_timeout="2secs" 
> --executor_shutdown_grace_period="5secs" 
> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" 
> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" 
> --hadoop_home="" --help="false" --hostname_lookup="true" 
> --http_command_executor="false" --http_heartbeat_interval="30secs" 
> --image_providers="docker" --image_provisioner_backend="overlay" 
> --initialize_driver_logging="true" 
> --isolation="cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime"
>  --launcher="linux" --launcher_dir="/home/aaron/Code/src/mesos/build/src" 
> --logbufsecs="0" --logging_level="INFO" --master="10.0.2.15:5050" 
> --max_completed_executors_per_framework="150" 
> --oversubscribed_resources_interval="15secs" --perf_duration="10secs" 
> --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" 
> --quiet="false" --recover="reconnect" --recovery_timeout="15mins" 
> --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" 
> 

[jira] [Commented] (MESOS-7714) Fix agent downgrade for reservation refinement

2017-07-28 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105874#comment-16105874
 ] 

Yan Xu commented on MESOS-7714:
---

I see, but 1) is a real operational concern for upgrading to 1.4 right? I 
wouldn't want to upgrade my agents to 1.4 knowing I won't be able to roll them 
back once refined reservations are made (i.e., after they are used for a 
while)...

I think to support 1) we have to support 'pre-reservation-refinement' for a 
while (across 1.x versions?)

https://github.com/apache/mesos/blob/master/docs/versioning.md#upgrades 
mentions upgrades but not downgrades but I don't see how it would work if 
downgrades are not implicitly covered by the same guarantee...

Thoughts?

> Fix agent downgrade for reservation refinement
> --
>
> Key: MESOS-7714
> URL: https://issues.apache.org/jira/browse/MESOS-7714
> Project: Mesos
>  Issue Type: Bug
>Reporter: Michael Park
>Priority: Blocker
>
> The agent code only partially supports downgrading of an agent correctly.
> The checkpointed resources are done correctly, but the resources within
> the {{SlaveInfo}} message as well as tasks and executors also need to be 
> downgraded
> correctly and converted back on recovery.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7215) Race condition on re-registration of non-partition-aware frameworks

2017-07-28 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-7215:

Fix Version/s: (was: 1.3.0)

> Race condition on re-registration of non-partition-aware frameworks
> ---
>
> Key: MESOS-7215
> URL: https://issues.apache.org/jira/browse/MESOS-7215
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Yan Xu
>Assignee: Megha Sharma
>Priority: Critical
>
> Prior to the partition-awareness work MESOS-5344, upon agent reregistration 
> after it has been removed, the master only sends ShutdownFrameworkMessages to 
> the agent for frameworks that it knows have been torn down. 
> With the new logic in MESOS-5344, Mesos is now sending 
> {{ShutdownFrameworkMessages}} to the agent for all non-partition-aware 
> frameworks (including the ones that are still registered)
> This is problematic. The offer from this agent can still go to the same 
> framework which can then launch new tasks. The agent then receives tasks of 
> the same framework and ignores them because it thinks the framework is 
> shutting down. The framework is not shutting down of course, so from the 
> master and the scheduler's perspective the task is pending in STAGING forever 
> until the next agent reregistration, which could happen much later.
> This also makes the semantics of `ShutdownFrameworkMessage` ambiguous: the 
> agent is assuming the framework to be going away (and act accordingly) when 
> it's not. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7215) Race condition on re-registration of non-partition-aware frameworks

2017-07-28 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-7215:

Target Version/s: 1.2.2, 1.4.0  (was: 1.2.2, 1.3.0)

> Race condition on re-registration of non-partition-aware frameworks
> ---
>
> Key: MESOS-7215
> URL: https://issues.apache.org/jira/browse/MESOS-7215
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Yan Xu
>Assignee: Megha Sharma
>Priority: Critical
>
> Prior to the partition-awareness work MESOS-5344, upon agent reregistration 
> after it has been removed, the master only sends ShutdownFrameworkMessages to 
> the agent for frameworks that it knows have been torn down. 
> With the new logic in MESOS-5344, Mesos is now sending 
> {{ShutdownFrameworkMessages}} to the agent for all non-partition-aware 
> frameworks (including the ones that are still registered)
> This is problematic. The offer from this agent can still go to the same 
> framework which can then launch new tasks. The agent then receives tasks of 
> the same framework and ignores them because it thinks the framework is 
> shutting down. The framework is not shutting down of course, so from the 
> master and the scheduler's perspective the task is pending in STAGING forever 
> until the next agent reregistration, which could happen much later.
> This also makes the semantics of `ShutdownFrameworkMessage` ambiguous: the 
> agent is assuming the framework to be going away (and act accordingly) when 
> it's not. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7389) Mesos 1.2.0 crashes with pre-1.0 Mesos agents.

2017-07-28 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-7389:

Target Version/s: 1.3.0, 1.2.1, 1.4.0  (was: 1.2.1, 1.3.1, 1.4.0)

> Mesos 1.2.0 crashes with pre-1.0 Mesos agents.
> --
>
> Key: MESOS-7389
> URL: https://issues.apache.org/jira/browse/MESOS-7389
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
> Environment: Ubuntu 14.04 
>Reporter: Nicholas Studt
>Assignee: Neil Conway
>Priority: Critical
>  Labels: mesosphere
> Fix For: 1.2.1, 1.3.0, 1.4.0
>
>
> During upgrade from 1.0.1 to 1.2.0 a single mesos-slave reregistering with 
> the running leader caused the leader to terminate. All 3 of the masters 
> suffered the same failure as the same slave node reregistered against the new 
> leader, this continued across the entire cluster until the offending slave 
> node was removed and fixed. The fix to the slave node was to remove the mesos 
> directory and then start the slave node back up. 
>  F0412 17:24:42.736600  6317 master.cpp:5701] Check failed: 
> frameworks_.contains(task.framework_id())
>  *** Check failure stack trace: ***
>  @ 0x7f59f944f94d  google::LogMessage::Fail()
>  @ 0x7f59f945177d  google::LogMessage::SendToLog()
>  @ 0x7f59f944f53c  google::LogMessage::Flush()
>  @ 0x7f59f9452079  google::LogMessageFatal::~LogMessageFatal()
>  I0412 17:24:42.750300  6316 replica.cpp:693] Replica received learned notice 
> for position 6896 from @0.0.0.0:0 
>  @ 0x7f59f88f2341  mesos::internal::master::Master::_reregisterSlave()
>  @ 0x7f59f88f488f  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERKSt6vectorINS5_8ResourceESaISG_EERKSF_INS5_12ExecutorInfoESaISL_EERKSF_INS5_4TaskESaISQ_EERKSF_INS5_13FrameworkInfoESaISV_EERKSF_INS6_17Archive_FrameworkESaIS10_EERKSsRKSF_INS5_20SlaveInfo_CapabilityESaIS17_EERKNS0_6FutureIbEES9_SC_SI_SN_SS_SX_S12_SsS19_S1D_EEvRKNS0_3PIDIT_EEMS1H_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_T19_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
>  @ 0x7f59f93c3eb1  process::ProcessManager::resume()
>  @ 0x7f59f93ccd57  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
>  @ 0x7f59f77cfa60  (unknown)
>  @ 0x7f59f6fec184  start_thread
>  @ 0x7f59f6d19bed  (unknown)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7215) Race condition on re-registration of non-partition-aware frameworks

2017-07-28 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-7215:

Fix Version/s: 1.3.0

> Race condition on re-registration of non-partition-aware frameworks
> ---
>
> Key: MESOS-7215
> URL: https://issues.apache.org/jira/browse/MESOS-7215
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Yan Xu
>Assignee: Megha Sharma
>Priority: Critical
> Fix For: 1.3.0
>
>
> Prior to the partition-awareness work MESOS-5344, upon agent reregistration 
> after it has been removed, the master only sends ShutdownFrameworkMessages to 
> the agent for frameworks that it knows have been torn down. 
> With the new logic in MESOS-5344, Mesos is now sending 
> {{ShutdownFrameworkMessages}} to the agent for all non-partition-aware 
> frameworks (including the ones that are still registered)
> This is problematic. The offer from this agent can still go to the same 
> framework which can then launch new tasks. The agent then receives tasks of 
> the same framework and ignores them because it thinks the framework is 
> shutting down. The framework is not shutting down of course, so from the 
> master and the scheduler's perspective the task is pending in STAGING forever 
> until the next agent reregistration, which could happen much later.
> This also makes the semantics of `ShutdownFrameworkMessage` ambiguous: the 
> agent is assuming the framework to be going away (and act accordingly) when 
> it's not. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7215) Race condition on re-registration of non-partition-aware frameworks

2017-07-28 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-7215:

Target Version/s: 1.3.0, 1.2.2, 1.4.0  (was: 1.2.2, 1.3.1, 1.4.0)

> Race condition on re-registration of non-partition-aware frameworks
> ---
>
> Key: MESOS-7215
> URL: https://issues.apache.org/jira/browse/MESOS-7215
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Yan Xu
>Assignee: Megha Sharma
>Priority: Critical
> Fix For: 1.3.0
>
>
> Prior to the partition-awareness work MESOS-5344, upon agent reregistration 
> after it has been removed, the master only sends ShutdownFrameworkMessages to 
> the agent for frameworks that it knows have been torn down. 
> With the new logic in MESOS-5344, Mesos is now sending 
> {{ShutdownFrameworkMessages}} to the agent for all non-partition-aware 
> frameworks (including the ones that are still registered)
> This is problematic. The offer from this agent can still go to the same 
> framework which can then launch new tasks. The agent then receives tasks of 
> the same framework and ignores them because it thinks the framework is 
> shutting down. The framework is not shutting down of course, so from the 
> master and the scheduler's perspective the task is pending in STAGING forever 
> until the next agent reregistration, which could happen much later.
> This also makes the semantics of `ShutdownFrameworkMessage` ambiguous: the 
> agent is assuming the framework to be going away (and act accordingly) when 
> it's not. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7215) Race condition on re-registration of non-partition-aware frameworks

2017-07-28 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-7215:

Target Version/s: 1.3.0, 1.2.2  (was: 1.2.2, 1.3.0, 1.4.0)

> Race condition on re-registration of non-partition-aware frameworks
> ---
>
> Key: MESOS-7215
> URL: https://issues.apache.org/jira/browse/MESOS-7215
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Yan Xu
>Assignee: Megha Sharma
>Priority: Critical
> Fix For: 1.3.0
>
>
> Prior to the partition-awareness work MESOS-5344, upon agent reregistration 
> after it has been removed, the master only sends ShutdownFrameworkMessages to 
> the agent for frameworks that it knows have been torn down. 
> With the new logic in MESOS-5344, Mesos is now sending 
> {{ShutdownFrameworkMessages}} to the agent for all non-partition-aware 
> frameworks (including the ones that are still registered)
> This is problematic. The offer from this agent can still go to the same 
> framework which can then launch new tasks. The agent then receives tasks of 
> the same framework and ignores them because it thinks the framework is 
> shutting down. The framework is not shutting down of course, so from the 
> master and the scheduler's perspective the task is pending in STAGING forever 
> until the next agent reregistration, which could happen much later.
> This also makes the semantics of `ShutdownFrameworkMessage` ambiguous: the 
> agent is assuming the framework to be going away (and act accordingly) when 
> it's not. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7252) Need to fix resource check in long-lived framework

2017-07-28 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-7252:

Fix Version/s: 1.3.1
   1.2.2

> Need to fix resource check in long-lived framework
> --
>
> Key: MESOS-7252
> URL: https://issues.apache.org/jira/browse/MESOS-7252
> Project: Mesos
>  Issue Type: Bug
>  Components: framework
>Reporter: Avinash Sridharan
>Assignee: Michael Park
>  Labels: Mesosphere
> Fix For: 1.2.2, 1.3.1, 1.4.0
>
>
> The multi-role changes in Mesos changed the implementation of 
> `Resources::contains`.
> This results in the search for a given resource to be performed only for 
> unallocated resources.
> For allocated resources the search is actually performed only for a given 
> role. 
> Due to this change the resource check in both the long-lived framework are 
> failing leading to these frameworks not launching any tasks. 
> The fix would be to unallocate all resources in a given offer and than do the 
> `contains` check.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7252) Need to fix resource check in long-lived framework

2017-07-28 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-7252:

Target Version/s: 1.2.2, 1.3.1, 1.4.0  (was: 1.4.0)

> Need to fix resource check in long-lived framework
> --
>
> Key: MESOS-7252
> URL: https://issues.apache.org/jira/browse/MESOS-7252
> Project: Mesos
>  Issue Type: Bug
>  Components: framework
>Reporter: Avinash Sridharan
>Assignee: Michael Park
>  Labels: Mesosphere
> Fix For: 1.2.2, 1.3.1, 1.4.0
>
>
> The multi-role changes in Mesos changed the implementation of 
> `Resources::contains`.
> This results in the search for a given resource to be performed only for 
> unallocated resources.
> For allocated resources the search is actually performed only for a given 
> role. 
> Due to this change the resource check in both the long-lived framework are 
> failing leading to these frameworks not launching any tasks. 
> The fix would be to unallocate all resources in a given offer and than do the 
> `contains` check.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7714) Fix agent downgrade for reservation refinement

2017-07-28 Thread Michael Park (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105864#comment-16105864
 ] 

Michael Park commented on MESOS-7714:
-

Ah okay. First, the support is for downgrading a 1.4 agent to <= 1.3.x agent as 
long as refined reservations have not been made yet.
The way we achieve this is to "downgrade" all the resources that get 
checkpointed in the "pre-reservation-refinement" format
as long as none of them have refined reservations. The reason why that 
{{CHECK}} would fail would be either (1) there are refined
reservations made on the 1.4 agent, or (2) there are resources that we didn't 
checkpoint in the "pre-reservation-refinement" when
we should have. The goal of this ticket is to fix (2).

> Fix agent downgrade for reservation refinement
> --
>
> Key: MESOS-7714
> URL: https://issues.apache.org/jira/browse/MESOS-7714
> Project: Mesos
>  Issue Type: Bug
>Reporter: Michael Park
>Priority: Blocker
>
> The agent code only partially supports downgrading of an agent correctly.
> The checkpointed resources are done correctly, but the resources within
> the {{SlaveInfo}} message as well as tasks and executors also need to be 
> downgraded
> correctly and converted back on recovery.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-5187) The filesystem/linux isolator does not set the permissions of the host_path.

2017-07-28 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-5187:

Summary: The filesystem/linux isolator does not set the permissions of the 
host_path.  (was: filesystem/linux isolator does not set the permissions of the 
host_path)

> The filesystem/linux isolator does not set the permissions of the host_path.
> 
>
> Key: MESOS-5187
> URL: https://issues.apache.org/jira/browse/MESOS-5187
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 0.26.0
> Environment: Mesos 0.26.0, Apache Aurora 0.12
>Reporter: Stephan Erb
>Assignee: Gilbert Song
>  Labels: mesosphere, volumes
>
> The {{filesystem/linux}} isolator is not a drop in replacement for the 
> {{filesystem/shared}} isolator. This should be considered before the latter 
> is deprecated.
> We are currently using the {{filesystem/shared}} isolator together with the 
> following slave option. This provides us with a private {{/tmp}} and 
> {{/var/tmp}} folder for each task.
> {code}
> --default_container_info='{
> "type": "MESOS",
> "volumes": [
> {"host_path": "system/tmp", "container_path": "/tmp", 
>"mode": "RW"},
> {"host_path": "system/vartmp",  "container_path": "/var/tmp", 
>"mode": "RW"}
> ]
> }'
> {code}
> When browsing the Mesos sandbox, one can see the following permissions:
> {code}
> mode  nlink   uid gid sizemtime   
> drwxrwxrwx3   rootroot4 KBApr 11 18:16 tmp
> drwxrwxrwx2   rootroot4 KBApr 11 18:15 vartmp 
> {code}
> However, when running with the new {{filesystem/linux}} isolator, the 
> permissions are different:
> {code}
> mode  nlink   uid gid sizemtime   
> drwxr-xr-x 2  rootroot4 KBApr 12 10:34 tmp
> drwxr-xr-x 2  rootroot4 KBApr 12 10:34 vartmp
> {code}
> This prevents user code (running as a non-root user) from writing to those 
> folders, i.e. every write attempt fails with permission denied. 
> *Context*:
> * We are using Apache Aurora. Aurora is running its custom executor as root 
> but then switches to a non-privileged user before running the actual user 
> code. 
> * The follow code seems to have enabled our usecase in the existing 
> {{filesystem/shared}} isolator: 
> https://github.com/apache/mesos/blob/4d2b1b793e07a9c90b984ca330a3d7bc9e1404cc/src/slave/containerizer/mesos/isolators/filesystem/shared.cpp#L175-L198
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-2092) Make ACLs dynamic

2017-07-28 Thread Sai Teja Ranuva (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105548#comment-16105548
 ] 

Sai Teja Ranuva commented on MESOS-2092:


I would like to work on this feature if there is no one else working on this 
already.
I have been using mesos for half a year, but haven't contributed to it yet. 
I am interested in this issue as this has a newbie tag and I have run into due
to a use case I am working on.

> Make ACLs dynamic
> -
>
> Key: MESOS-2092
> URL: https://issues.apache.org/jira/browse/MESOS-2092
> Project: Mesos
>  Issue Type: Task
>  Components: security
>Reporter: Alexander Rukletsov
>Assignee: Yongqiao Wang
>  Labels: mesosphere, newbie
>
> Master loads ACLs once during its launch and there is no way to update them 
> in a running master. Making them dynamic will allow for updating ACLs on the 
> fly, for example granting a new framework necessary rights.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6846) Support `teardown` in the v1 operator API.

2017-07-28 Thread Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105397#comment-16105397
 ] 

Quinn commented on MESOS-6846:
--

https://reviews.apache.org/r/61222/

> Support `teardown` in the v1 operator API.
> --
>
> Key: MESOS-6846
> URL: https://issues.apache.org/jira/browse/MESOS-6846
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Joerg Schad
>Assignee: Quinn
>  Labels: mesosphere
>
> Currently, the v1 operator API does not support teardown of frameworks.
> The semantics should be similar to the old HTTP endpoint: 
> http://mesos.apache.org/documentation/latest/endpoints/master/teardown/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-6846) Support `teardown` in the v1 operator API.

2017-07-28 Thread Quinn (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quinn reassigned MESOS-6846:


Assignee: Quinn

> Support `teardown` in the v1 operator API.
> --
>
> Key: MESOS-6846
> URL: https://issues.apache.org/jira/browse/MESOS-6846
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Joerg Schad
>Assignee: Quinn
>  Labels: mesosphere
>
> Currently, the v1 operator API does not support teardown of frameworks.
> The semantics should be similar to the old HTTP endpoint: 
> http://mesos.apache.org/documentation/latest/endpoints/master/teardown/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7785) Pass Operator API subscription events through authorizer

2017-07-28 Thread Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105340#comment-16105340
 ] 

Quinn commented on MESOS-7785:
--

https://reviews.apache.org/r/61189/

> Pass Operator API subscription events through authorizer 
> -
>
> Key: MESOS-7785
> URL: https://issues.apache.org/jira/browse/MESOS-7785
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Mathew Appelman
>Assignee: Quinn
>
> In order to consume the subscription endpoint from the Operator API in the 
> DC/OS UI, we must ensure a user can only receive events they are authorized 
> to consume.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-6082) Add scheduler Call and Event based metrics to the master.

2017-07-28 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6082:
---
Labels: mesosphere metrics tech-debt  (was: mesosphere)

> Add scheduler Call and Event based metrics to the master.
> -
>
> Key: MESOS-6082
> URL: https://issues.apache.org/jira/browse/MESOS-6082
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Benjamin Mahler
>Assignee: Abhishek Dasgupta
>Priority: Critical
>  Labels: mesosphere, metrics, tech-debt
>
> Currently, the master only has metrics for the old-style messages and these 
> are re-used for calls unfortunately:
> {code}
>   // Messages from schedulers.
>   process::metrics::Counter messages_register_framework;
>   process::metrics::Counter messages_reregister_framework;
>   process::metrics::Counter messages_unregister_framework;
>   process::metrics::Counter messages_deactivate_framework;
>   process::metrics::Counter messages_kill_task;
>   process::metrics::Counter messages_status_update_acknowledgement;
>   process::metrics::Counter messages_resource_request;
>   process::metrics::Counter messages_launch_tasks;
>   process::metrics::Counter messages_decline_offers;
>   process::metrics::Counter messages_revive_offers;
>   process::metrics::Counter messages_suppress_offers;
>   process::metrics::Counter messages_reconcile_tasks;
>   process::metrics::Counter messages_framework_to_executor;
> {code}
> Now that we've introduced the Call/Event based API, we should have metrics 
> that reflect this. For example:
> {code}
> {
>   scheduler/calls: 100
>   scheduler/calls/decline: 90,
>   scheduler/calls/accept: 10,
>   scheduler/calls/accept/operations/create: 1,
>   scheduler/calls/accept/operations/destroy: 0,
>   scheduler/calls/accept/operations/launch: 4,
>   scheduler/calls/accept/operations/launch_group: 2,
>   scheduler/calls/accept/operations/reserve: 1,
>   scheduler/calls/accept/operations/unreserve: 0,
>   scheduler/calls/kill: 0,
>   // etc
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7814) Improve the code style of the test frameworks

2017-07-28 Thread Armand Grillet (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Armand Grillet updated MESOS-7814:
--
Target Version/s: 1.4.0  (was: 1.2.2, 1.4.0)

> Improve the code style of the test frameworks
> -
>
> Key: MESOS-7814
> URL: https://issues.apache.org/jira/browse/MESOS-7814
> Project: Mesos
>  Issue Type: Improvement
>  Components: framework
>Reporter: Armand Grillet
>Assignee: Armand Grillet
>Priority: Minor
>  Labels: mesosphere, newbie
>
> These improvements include three main points:
> * Adding a {{name}} flag to certain frameworks to distinguish between 
> instances.
> * Cleaning up the code style of the frameworks.
> * For frameworks with custom executors, such as balloon framework, adding a 
> {{executor_extra_uris}} flag containing URIs that will be passed to the 
> {{command_info}} of the executor.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7823) Reorganize the new Mesos CLI to live under src/python

2017-07-28 Thread Armand Grillet (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Armand Grillet updated MESOS-7823:
--
Sprint: Mesosphere Sprint 60

> Reorganize the new Mesos CLI to live under src/python
> -
>
> Key: MESOS-7823
> URL: https://issues.apache.org/jira/browse/MESOS-7823
> Project: Mesos
>  Issue Type: Improvement
>  Components: cli
>Reporter: Kevin Klues
>Assignee: Armand Grillet
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7840) Add Mesos CLI command to list active tasks

2017-07-28 Thread Armand Grillet (JIRA)
Armand Grillet created MESOS-7840:
-

 Summary: Add Mesos CLI command to list active tasks
 Key: MESOS-7840
 URL: https://issues.apache.org/jira/browse/MESOS-7840
 Project: Mesos
  Issue Type: Improvement
  Components: cli
Reporter: Armand Grillet
Assignee: Armand Grillet


We need to add a command to list all the tasks running in a Mesos cluster by 
checking the endpoint {{/tasks}} and reporting the results.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7284) Allow Mesos CLI to take a master IP

2017-07-28 Thread Armand Grillet (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Armand Grillet updated MESOS-7284:
--
Summary: Allow Mesos CLI to take a master IP  (was: Allow Mesos CLI to take 
an agent IP)

> Allow Mesos CLI to take a master IP
> ---
>
> Key: MESOS-7284
> URL: https://issues.apache.org/jira/browse/MESOS-7284
> Project: Mesos
>  Issue Type: Task
>Reporter: Avinash Sridharan
>Assignee: Armand Grillet
>
> Allow the Mesos CLI to take an agent IP. This will allow the CLI to remotely 
> connect to an agent and run commands on that agent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7284) Allow Mesos CLI to take a master IP

2017-07-28 Thread Armand Grillet (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Armand Grillet updated MESOS-7284:
--
Description: Allow the Mesos CLI to take a master IP. This will allow the 
CLI to send HTTP requests to that master.  (was: Allow the Mesos CLI to take an 
agent IP. This will allow the CLI to remotely connect to an agent and run 
commands on that agent.)

> Allow Mesos CLI to take a master IP
> ---
>
> Key: MESOS-7284
> URL: https://issues.apache.org/jira/browse/MESOS-7284
> Project: Mesos
>  Issue Type: Task
>Reporter: Avinash Sridharan
>Assignee: Armand Grillet
>
> Allow the Mesos CLI to take a master IP. This will allow the CLI to send HTTP 
> requests to that master.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7284) Allow Mesos CLI to take an agent IP

2017-07-28 Thread Armand Grillet (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104911#comment-16104911
 ] 

Armand Grillet commented on MESOS-7284:
---

After discussing with [~klueska], we now wish to use the master to get info 
about the cluster. The ticket previously referenced has been updated according 
to this change.

> Allow Mesos CLI to take an agent IP
> ---
>
> Key: MESOS-7284
> URL: https://issues.apache.org/jira/browse/MESOS-7284
> Project: Mesos
>  Issue Type: Task
>Reporter: Avinash Sridharan
>Assignee: Armand Grillet
>
> Allow the Mesos CLI to take an agent IP. This will allow the CLI to remotely 
> connect to an agent and run commands on that agent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7643) The order of isolators provided in '--isolation' flag is not preserved and instead sorted alphabetically

2017-07-28 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7643:
---
Target Version/s: 1.4.0  (was: 1.2.2, 1.3.1, 1.4.0, 1.1.3)

> The order of isolators provided in '--isolation' flag is not preserved and 
> instead sorted alphabetically
> 
>
> Key: MESOS-7643
> URL: https://issues.apache.org/jira/browse/MESOS-7643
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.1.2, 1.2.0, 1.3.0
>Reporter: Michael Cherny
>Assignee: Gilbert Song
>Priority: Critical
>  Labels: isolation
>
> According to documentation and comments in code the order of the entries in 
> the --isolation flag should specify the ordering of the isolators. 
> Specifically, the
> `create` and `prepare` calls for each isolator should run serially in the 
> order in which they appear in the --isolation flag, while the `cleanup` call 
> should be serialized in reverse order (with exception of filesystem isolator 
> which is always first).
> But in fact, the isolators provided in '--isolation' flag are sorted 
> alphabetically.
> That happens in [this line of 
> code|https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L377].
>  In this line use of 'set' is done (apparently instead of list or 
> vector) and set is a sorted container.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7349) Document Mesos "check" feature.

2017-07-28 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7349:
---
Target Version/s: 1.4.0
Priority: Blocker  (was: Major)

> Document Mesos "check" feature.
> ---
>
> Key: MESOS-7349
> URL: https://issues.apache.org/jira/browse/MESOS-7349
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>Priority: Blocker
>  Labels: documentation, mesosphere
>
> This should include framework authors recommendations about how and when to 
> use general checks as well as comparison with health checks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7652) Docker image with universal containerizer does not work if WORKDIR is missing in the rootfs.

2017-07-28 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-7652:

Priority: Critical  (was: Minor)

> Docker image with universal containerizer does not work if WORKDIR is missing 
> in the rootfs.
> 
>
> Key: MESOS-7652
> URL: https://issues.apache.org/jira/browse/MESOS-7652
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.2.1
>Reporter: michael beisiegel
>Assignee: Gilbert Song
>Priority: Critical
>  Labels: mesosphere
>
> hello,
> used the following docker image recently
> quay.io/spinnaker/front50:master
> https://quay.io/repository/spinnaker/front50
> Here the link to the Dockerfile
> https://github.com/spinnaker/front50/blob/master/Dockerfile
> and here the source
> {color:blue}FROM java:8
> MAINTAINER delivery-engineer...@netflix.com
> COPY . workdir/
> WORKDIR workdir
> RUN GRADLE_USER_HOME=cache ./gradlew buildDeb -x test && \
>   dpkg -i ./front50-web/build/distributions/*.deb && \
>   cd .. && \
>   rm -rf workdir
> CMD ["/opt/front50/bin/front50"]{color}
> The image works fine with the docker containerizer, but the universal 
> containerizer shows the following in stderr.
> "Failed to chdir into current working directory '/workdir': No such file or 
> directory"
> The problem comes from the fact that the Dockerfile creates a workdir but 
> then later removes the created dir as part of a RUN. The docker containerizer 
> has no problem with it if you do
> docker run -ti --rm quay.io/spinnaker/front50:master bash
> you get into the working dir, but the universal containerizer fails with the 
> error.
> thanks for your help,
> Michael



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7492) Introduce a daemon manager in the agent.

2017-07-28 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104557#comment-16104557
 ] 

James DeFelice commented on MESOS-7492:
---

Instead of health checks/auto-restart I'd actually like to see a way to adjust 
the "kill" signal that an agent will send to a daemon in order to shut it down. 
Especially if we want to support containerizing the various supervision systems 
that already exist in the wild (s6, systemd, etc).

> Introduce a daemon manager in the agent.
> 
>
> Key: MESOS-7492
> URL: https://issues.apache.org/jira/browse/MESOS-7492
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Joseph Wu
>  Labels: mesosphere, storage
>
> Once we have standalone container support from the containerizer, we should 
> consider adding a daemon manager inside the agent. It'll be like 'monit', 
> 'upstart' or 'systemd', but with very limited functionalities. For instance, 
> as a start, the manager will simply always restart the daemons if the daemon 
> fails. It'll also try to cleanup unknown daemons.
> This feature will be used to manage CSI plugin containers on the agent.
> The daemon manager should have an interface allowing operators to "register" 
> a daemon with a name and a config of the daemon. The daemon manager is 
> responsible for restarting the daemon if it crashes until some one explicitly 
> "unregister" it. Some simple backoff and health check functionality should be 
> provided.
> We probably need a small design doc for this.
> {code}
> message DaemonConfig {
>   optional ContainerInfo container;
>   optional CommandInfo command;
>   optional uint32 poll_interval;
>   optional uint32 initial_delay;
>   optional CheckInfo check; // For health check.
> }
> class DaemonManager
> {
> public:
>   Future register(
> const ContainerID& containerId,
> const DaemonConfig& config;
>   Future unregister(const ContainerID& containerId);
>   Future ps();
>   Future status(const ContainerID& containerId);
> };
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7492) Introduce a daemon manager in the agent.

2017-07-28 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104551#comment-16104551
 ] 

James DeFelice commented on MESOS-7492:
---

Could we start with an even more minimal design and (a) get rid of the health 
check fields (poll_interval, initial_delay, and check), and (b) eliminate the 
auto-restart feature? We can add these later if/when needed as requirements and 
user stories evolve.

There's lots of supervision tooling to choose from already and it's not clear 
to me that Mesos should spend the time reinventing this wheel right now. Also, 
supporting run-once daemon tasks actually supports **both** run-once and 
run-forever models (run-forever tasks just need **some** supervisor process 
above the actual service -- that supervisor doesn't need to be Mesos).

> Introduce a daemon manager in the agent.
> 
>
> Key: MESOS-7492
> URL: https://issues.apache.org/jira/browse/MESOS-7492
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Joseph Wu
>  Labels: mesosphere, storage
>
> Once we have standalone container support from the containerizer, we should 
> consider adding a daemon manager inside the agent. It'll be like 'monit', 
> 'upstart' or 'systemd', but with very limited functionalities. For instance, 
> as a start, the manager will simply always restart the daemons if the daemon 
> fails. It'll also try to cleanup unknown daemons.
> This feature will be used to manage CSI plugin containers on the agent.
> The daemon manager should have an interface allowing operators to "register" 
> a daemon with a name and a config of the daemon. The daemon manager is 
> responsible for restarting the daemon if it crashes until some one explicitly 
> "unregister" it. Some simple backoff and health check functionality should be 
> provided.
> We probably need a small design doc for this.
> {code}
> message DaemonConfig {
>   optional ContainerInfo container;
>   optional CommandInfo command;
>   optional uint32 poll_interval;
>   optional uint32 initial_delay;
>   optional CheckInfo check; // For health check.
> }
> class DaemonManager
> {
> public:
>   Future register(
> const ContainerID& containerId,
> const DaemonConfig& config;
>   Future unregister(const ContainerID& containerId);
>   Future ps();
>   Future status(const ContainerID& containerId);
> };
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)