[jira] [Commented] (MESOS-8125) Agent should properly handle recovering an executor when its pid is reused

2018-01-09 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319818#comment-16319818
 ] 

Yan Xu commented on MESOS-8125:
---

We used to not need to handle recovering executors after a reboot because the 
agent would have been considered lost, so not only did we not to need recover 
the executors, we also didn't need to resume unacknowledged status updates etc.

In the new scenario we need to handle these so we cannot just simply remove the 
{{latest}} executor run symlink. I guess we should just short circuit the 
executor reconnect/reregister logic based on the {{rebooted}} field in the 
top-level {{State}} but keep the rest of the recovery logic.

> Agent should properly handle recovering an executor when its pid is reused
> --
>
> Key: MESOS-8125
> URL: https://issues.apache.org/jira/browse/MESOS-8125
> Project: Mesos
>  Issue Type: Bug
>Reporter: Gastón Kleiman
>Assignee: Megha Sharma
>Priority: Critical
>
> We know that all executors will be gone once the host on which an agent is 
> running is rebooted, so there's no need to try to recover these executors.
> Trying to recover stopped executors can lead to problems if another process 
> is assigned the same pid that the executor had before the reboot. In this 
> case the agent will unsuccessfully try to reregister with the executor, and 
> then transition it to a {{TERMINATING}} state. The executor will sadly get 
> stuck in that state, and the tasks that it started will get stuck in whatever 
> state they were in at the time of the reboot.
> One way of getting rid of stuck executors is to remove the {{latest}} symlink 
> under {{work_dir/meta/slaves/latest/frameworks/ id>/executors//runs}.
> Here's how to reproduce this issue:
> # Start a task using the Docker containerizer (the same will probably happen 
> with the command executor).
> # Stop the corresponding Mesos agent while the task is running.
> # Change the executor's checkpointed forked pid, which is located in the meta 
> directory, e.g., 
> {{/var/lib/mesos/slave/meta/slaves/latest/frameworks/19faf6e0-3917-48ab-8b8e-97ec4f9ed41e-0001/executors/foo.13faee90-b5f0-11e7-8032-e607d2b4348c/runs/latest/pids/forked.pid}}.
>  I used pid 2, which is normally used by {{kthreadd}}.
> # Reboot the host



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8413) Zookeeper configuration passwords are shown in clear text

2018-01-09 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319625#comment-16319625
 ] 

James Peach commented on MESOS-8413:


There's a similar issue with URLs for the {{CommandInfo.URI}} message. IIRC 
when I looked into that, the problem was that there was no code to crack the 
credentials out of the URL, so it wasn't even clear that the URL credentials 
didn't just happen to work by accident. These passwords end up in log files.

> Zookeeper configuration passwords are shown in clear text
> -
>
> Key: MESOS-8413
> URL: https://issues.apache.org/jira/browse/MESOS-8413
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.4.1
>Reporter: Alexander Rojas
>Assignee: Alexander Rojas
>  Labels: mesosphere, security
>
> No matter how one configures mesos, either by passing the ZooKeeper flags in 
> the command line or using a file, as follows:
> {noformat}
> ./bin/mesos-master.sh --work_dir=/tmp/$USER/mesos/master 
> --log_dir=/tmp/$USER/mesos/master/log 
> --zk=zk://${zk_username}:${zk_password}@${zk_addr}/mesos --quorum=1
> {noformat}
> {noformat}
> echo "zk://${zk_username}:${zk_password}@${zk_addr}/mesos" > 
> /tmp/${USER}/mesos/zk_config.txt
> ./bin/mesos-master.sh --work_dir=/tmp/$USER/mesos/master 
> --log_dir=/tmp/$USER/mesos/master/log --zk=/tmp/${USER}/mesos/zk_config.txt
> {noformat}
> both the logs and the results of the {{/flags}} endpoint will resolve to the 
> contents of the flags, i.e.:
> {noformat}
> I0108 10:12:50.387522 28579 master.cpp:458] Flags at startup: 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="false" --authenticate_frameworks="false" 
> --authenticate_http_frameworks="false" --authenticate_http_readonly="false" 
> --authenticate_http_readwrite="false" --authenticators="crammd5" 
> --authorizers="local" --filter_gpu_resources="true" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --log_dir="/tmp/user/mesos/master/log" --logbufsecs="0" 
> --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
> --quorum="1" --recovery_agent_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" 
> --registry_max_agent_count="102400" --registry_store_timeout="20secs" 
> --registry_strict="false" --require_agent_domain="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/home/user/mesos/build/../src/webui" 
> --work_dir="/tmp/user/mesos/master" 
> --zk="zk://user@passwd:127.0.0.1:2181/mesos" --zk_session_timeout="10secs"
> {noformat}
> {noformat}
> HTTP/1.1 200 OK
> Content-Encoding: gzip
> Content-Length: 591
> Content-Type: application/json
> Date: Mon, 08 Jan 2018 15:12:53 GMT
> {
> "flags": {
> "agent_ping_timeout": "15secs",
> "agent_reregister_timeout": "10mins",
> "allocation_interval": "1secs",
> "allocator": "HierarchicalDRF",
> "authenticate_agents": "false",
> "authenticate_frameworks": "false",
> "authenticate_http_frameworks": "false",
> "authenticate_http_readonly": "false",
> "authenticate_http_readwrite": "false",
> "authenticators": "crammd5",
> "authorizers": "local",
> "filter_gpu_resources": "true",
> "framework_sorter": "drf",
> "help": "false",
> "hostname_lookup": "true",
> "http_authenticators": "basic",
> "initialize_driver_logging": "true",
> "log_auto_initialize": "true",
> "log_dir": "/tmp/user/mesos/master/log",
> "logbufsecs": "0",
> "logging_level": "INFO",
> "max_agent_ping_timeouts": "5",
> "max_completed_frameworks": "50",
> "max_completed_tasks_per_framework": "1000",
> "max_unreachable_tasks_per_framework": "1000",
> "port": "5050",
> "quiet": "false",
> "quorum": "1",
> "recovery_agent_removal_limit": "100%",
> "registry": "replicated_log",
> "registry_fetch_timeout": "1mins",
> "registry_gc_interval": "15mins",
> "registry_max_agent_age": "2weeks",
> "registry_max_agent_count": "102400",
> "registry_store_timeout": "20secs",
> "registry_strict": "false",
> "require_agent_domain": "false",
> "root_submissions": "true",
> "user_sorter": "drf",
> 

[jira] [Updated] (MESOS-8422) Master's UpdateSlave handler not correctly updating terminated operations

2018-01-09 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/MESOS-8422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gastón Kleiman updated MESOS-8422:
--
Summary: Master's UpdateSlave handler not correctly updating terminated 
operations  (was: Master's UpdateSlave handler not correctly updating 
operations)

> Master's UpdateSlave handler not correctly updating terminated operations
> -
>
> Key: MESOS-8422
> URL: https://issues.apache.org/jira/browse/MESOS-8422
> Project: Mesos
>  Issue Type: Bug
>Reporter: Gastón Kleiman
>  Labels: mesosphere
>
> I created a test that verifies that operation status updates are resent to 
> the master after being dropped en route to it (MESOS-8420).
> The test does the following:
> # Creates a volume from a RAW disk resource.
> # Drops the first `UpdateOperationStatusMessage` message from the agent to 
> the master, so that it isn't acknowledged by the master.
> # Restarts the agent.
> # Verifies that the agent resends the operation status update.
> The good news are that the agent is resending the operation status update, 
> the bad news are that it triggers a CHECK failure that crashes the master.
> Here are the relevant sections of the log produced by the test:
> {noformat}
> [ RUN  ] 
> StorageLocalResourceProviderTest.ROOT_RetryOperationStatusUpdateAfterRecovery
> [...]
> I0109 16:36:08.515882 24106 master.cpp:4284] Processing ACCEPT call for 
> offers: [ 046b3f21-6e97-4a56-9a13-773f7d481efd-O0 ] on agent 
> 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 
> (core-dev) for framework 046b3f21-6e97-4a56-9a13-773f7d481efd- (default) 
> at scheduler-2a48a684-64b4-4b4d-a396-6491adb4f2b1@10.0.49.2:40681
> I0109 16:36:08.516487 24106 master.cpp:5260] Processing CREATE_VOLUME 
> operation with source disk(allocated: storage)(reservations: 
> [(DYNAMIC,storage)])[RAW(,volume-default)]:4096 from framework 
> 046b3f21-6e97-4a56-9a13-773f7d481efd- (default) at 
> scheduler-2a48a684-64b4-4b4d-a396-6491adb4f2b1@10.0.49.2:40681 to agent 
> 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 (core-dev)
> I0109 16:36:08.518704 24106 master.cpp:10622] Sending operation '' (uuid: 
> 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) to agent 
> 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 (core-dev)
> I0109 16:36:08.521210 24130 provider.cpp:504] Received APPLY_OPERATION event
> I0109 16:36:08.521276 24130 provider.cpp:1368] Received CREATE_VOLUME 
> operation '' (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408)
> I0109 16:36:08.523131 24432 test_csi_plugin.cpp:305] CreateVolumeRequest 
> '{"version":{"minor":1},"name":"18b4c4a5-d162-4dcf-bb21-a13c6ee0f408","capacityRange":{"requiredBytes":"4294967296","limitBytes":"4294967296"},"volumeCapabilities":[{"mount":{},"accessMode":{"mode":"SINGLE_NODE_WRITER"}}]}'
> I0109 16:36:08.525806 24152 provider.cpp:2635] Applying conversion from 
> 'disk(allocated: storage)(reservations: 
> [(DYNAMIC,storage)])[RAW(,volume-default)]:4096' to 'disk(allocated: 
> storage)(reservations: 
> [(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096'
>  for operation (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408)
> I0109 16:36:08.528725 24134 status_update_manager_process.hpp:152] Received 
> operation status update OPERATION_FINISHED (Status UUID: 
> 0c79cdf2-b89d-453b-bb62-57766e968dd0) for operation UUID 
> 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408 of framework 
> '046b3f21-6e97-4a56-9a13-773f7d481efd-' on agent 
> 046b3f21-6e97-4a56-9a13-773f7d481efd-S0
> I0109 16:36:08.529207 24134 status_update_manager_process.hpp:929] 
> Checkpointing UPDATE for operation status update OPERATION_FINISHED (Status 
> UUID: 0c79cdf2-b89d-453b-bb62-57766e968dd0) for operation UUID 
> 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408 of framework 
> '046b3f21-6e97-4a56-9a13-773f7d481efd-' on agent 
> 046b3f21-6e97-4a56-9a13-773f7d481efd-S0
> I0109 16:36:08.573177 24150 http.cpp:1185] HTTP POST for 
> /slave(2)/api/v1/resource_provider from 10.0.49.2:53598
> I0109 16:36:08.573974 24139 slave.cpp:7065] Handling resource provider 
> message 'UPDATE_OPERATION_STATUS: (uuid: 
> 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for framework 
> 046b3f21-6e97-4a56-9a13-773f7d481efd- (latest state: OPERATION_FINISHED, 
> status update state: OPERATION_FINISHED)'
> I0109 16:36:08.574154 24139 slave.cpp:7409] Updating the state of operation ' 
> with no ID (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for framework 
> 046b3f21-6e97-4a56-9a13-773f7d481efd- (latest state: OPERATION_FINISHED, 
> status update state: OPERATION_FINISHED)
> I0109 16:36:08.574785 24139 slave.cpp:7249] Forwarding status update of 
> operation with no ID 

[jira] [Updated] (MESOS-8422) Master's UpdateSlave handler not correctly updating operations

2018-01-09 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/MESOS-8422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gastón Kleiman updated MESOS-8422:
--
Description: 
I created a test that verifies that operation status updates are resent to the 
master after being dropped en route to it (MESOS-8420).

The test does the following:

# Creates a volume from a RAW disk resource.
# Drops the first `UpdateOperationStatusMessage` message from the agent to the 
master, so that it isn't acknowledged by the master.
# Restarts the agent.
# Verifies that the agent resends the operation status update.

The good news are that the agent is resending the operation status update, the 
bad news are that it triggers a CHECK failure that crashes the master.

Here are the relevant sections of the log produced by the test:

{noformat}
[ RUN  ] 
StorageLocalResourceProviderTest.ROOT_RetryOperationStatusUpdateAfterRecovery
[...]
I0109 16:36:08.515882 24106 master.cpp:4284] Processing ACCEPT call for offers: 
[ 046b3f21-6e97-4a56-9a13-773f7d481efd-O0 ] on agent 
046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 (core-dev) 
for framework 046b3f21-6e97-4a56-9a13-773f7d481efd- (default) at 
scheduler-2a48a684-64b4-4b4d-a396-6491adb4f2b1@10.0.49.2:40681
I0109 16:36:08.516487 24106 master.cpp:5260] Processing CREATE_VOLUME operation 
with source disk(allocated: storage)(reservations: 
[(DYNAMIC,storage)])[RAW(,volume-default)]:4096 from framework 
046b3f21-6e97-4a56-9a13-773f7d481efd- (default) at 
scheduler-2a48a684-64b4-4b4d-a396-6491adb4f2b1@10.0.49.2:40681 to agent 
046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 (core-dev)
I0109 16:36:08.518704 24106 master.cpp:10622] Sending operation '' (uuid: 
18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) to agent 
046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 (core-dev)
I0109 16:36:08.521210 24130 provider.cpp:504] Received APPLY_OPERATION event
I0109 16:36:08.521276 24130 provider.cpp:1368] Received CREATE_VOLUME operation 
'' (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408)
I0109 16:36:08.523131 24432 test_csi_plugin.cpp:305] CreateVolumeRequest 
'{"version":{"minor":1},"name":"18b4c4a5-d162-4dcf-bb21-a13c6ee0f408","capacityRange":{"requiredBytes":"4294967296","limitBytes":"4294967296"},"volumeCapabilities":[{"mount":{},"accessMode":{"mode":"SINGLE_NODE_WRITER"}}]}'
I0109 16:36:08.525806 24152 provider.cpp:2635] Applying conversion from 
'disk(allocated: storage)(reservations: 
[(DYNAMIC,storage)])[RAW(,volume-default)]:4096' to 'disk(allocated: 
storage)(reservations: 
[(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096'
 for operation (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408)
I0109 16:36:08.528725 24134 status_update_manager_process.hpp:152] Received 
operation status update OPERATION_FINISHED (Status UUID: 
0c79cdf2-b89d-453b-bb62-57766e968dd0) for operation UUID 
18b4c4a5-d162-4dcf-bb21-a13c6ee0f408 of framework 
'046b3f21-6e97-4a56-9a13-773f7d481efd-' on agent 
046b3f21-6e97-4a56-9a13-773f7d481efd-S0
I0109 16:36:08.529207 24134 status_update_manager_process.hpp:929] 
Checkpointing UPDATE for operation status update OPERATION_FINISHED (Status 
UUID: 0c79cdf2-b89d-453b-bb62-57766e968dd0) for operation UUID 
18b4c4a5-d162-4dcf-bb21-a13c6ee0f408 of framework 
'046b3f21-6e97-4a56-9a13-773f7d481efd-' on agent 
046b3f21-6e97-4a56-9a13-773f7d481efd-S0
I0109 16:36:08.573177 24150 http.cpp:1185] HTTP POST for 
/slave(2)/api/v1/resource_provider from 10.0.49.2:53598
I0109 16:36:08.573974 24139 slave.cpp:7065] Handling resource provider message 
'UPDATE_OPERATION_STATUS: (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for 
framework 046b3f21-6e97-4a56-9a13-773f7d481efd- (latest state: 
OPERATION_FINISHED, status update state: OPERATION_FINISHED)'
I0109 16:36:08.574154 24139 slave.cpp:7409] Updating the state of operation ' 
with no ID (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for framework 
046b3f21-6e97-4a56-9a13-773f7d481efd- (latest state: OPERATION_FINISHED, 
status update state: OPERATION_FINISHED)
I0109 16:36:08.574785 24139 slave.cpp:7249] Forwarding status update of 
operation with no ID (operation_uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for 
framework 046b3f21-6e97-4a56-9a13-773f7d481efd-
I0109 16:36:08.583748 24084 slave.cpp:931] Agent terminating
I0109 16:36:08.584115 24144 master.cpp:1305] Agent 
046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 (core-dev) 
disconnected
[...]
I0109 16:36:08.655766 24140 slave.cpp:1378] Re-registered with master 
master@10.0.49.2:40681
I0109 16:36:08.655936 24117 task_status_update_manager.cpp:188] Resuming 
sending task status updates
I0109 16:36:08.655995 24149 hierarchical.cpp:669] Agent 
046b3f21-6e97-4a56-9a13-773f7d481efd-S0 (core-dev) updated with total resources 
cpus:2; 

[jira] [Updated] (MESOS-8422) Master's UpdateSlave handler not correctly updating terminated operations

2018-01-09 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/MESOS-8422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gastón Kleiman updated MESOS-8422:
--
Description: 
I created a test that verifies that operation status updates are resent to the 
master after being dropped en route to it (MESOS-8420).

The test does the following:

# Creates a volume from a RAW disk resource.
# Drops the first `UpdateOperationStatusMessage` message from the agent to the 
master, so that it isn't acknowledged by the master.
# Restarts the agent.
# Verifies that the agent resends the operation status update.

The good news are that the agent is resending the operation status update, the 
bad news are that it triggers a CHECK failure that crashes the master.

Here are the relevant sections of the log produced by the test:

{noformat}
[ RUN  ] 
StorageLocalResourceProviderTest.ROOT_RetryOperationStatusUpdateAfterRecovery
[...]
I0109 16:36:08.515882 24106 master.cpp:4284] Processing ACCEPT call for offers: 
[ 046b3f21-6e97-4a56-9a13-773f7d481efd-O0 ] on agent 
046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 (core-dev) 
for framework 046b3f21-6e97-4a56-9a13-773f7d481efd- (default) at 
scheduler-2a48a684-64b4-4b4d-a396-6491adb4f2b1@10.0.49.2:40681
I0109 16:36:08.516487 24106 master.cpp:5260] Processing CREATE_VOLUME operation 
with source disk(allocated: storage)(reservations: 
[(DYNAMIC,storage)])[RAW(,volume-default)]:4096 from framework 
046b3f21-6e97-4a56-9a13-773f7d481efd- (default) at 
scheduler-2a48a684-64b4-4b4d-a396-6491adb4f2b1@10.0.49.2:40681 to agent 
046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 (core-dev)
I0109 16:36:08.518704 24106 master.cpp:10622] Sending operation '' (uuid: 
18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) to agent 
046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 (core-dev)
I0109 16:36:08.521210 24130 provider.cpp:504] Received APPLY_OPERATION event
I0109 16:36:08.521276 24130 provider.cpp:1368] Received CREATE_VOLUME operation 
'' (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408)
I0109 16:36:08.523131 24432 test_csi_plugin.cpp:305] CreateVolumeRequest 
'{"version":{"minor":1},"name":"18b4c4a5-d162-4dcf-bb21-a13c6ee0f408","capacityRange":{"requiredBytes":"4294967296","limitBytes":"4294967296"},"volumeCapabilities":[{"mount":{},"accessMode":{"mode":"SINGLE_NODE_WRITER"}}]}'
I0109 16:36:08.525806 24152 provider.cpp:2635] Applying conversion from 
'disk(allocated: storage)(reservations: 
[(DYNAMIC,storage)])[RAW(,volume-default)]:4096' to 'disk(allocated: 
storage)(reservations: 
[(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096'
 for operation (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408)
I0109 16:36:08.528725 24134 status_update_manager_process.hpp:152] Received 
operation status update OPERATION_FINISHED (Status UUID: 
0c79cdf2-b89d-453b-bb62-57766e968dd0) for operation UUID 
18b4c4a5-d162-4dcf-bb21-a13c6ee0f408 of framework 
'046b3f21-6e97-4a56-9a13-773f7d481efd-' on agent 
046b3f21-6e97-4a56-9a13-773f7d481efd-S0
I0109 16:36:08.529207 24134 status_update_manager_process.hpp:929] 
Checkpointing UPDATE for operation status update OPERATION_FINISHED (Status 
UUID: 0c79cdf2-b89d-453b-bb62-57766e968dd0) for operation UUID 
18b4c4a5-d162-4dcf-bb21-a13c6ee0f408 of framework 
'046b3f21-6e97-4a56-9a13-773f7d481efd-' on agent 
046b3f21-6e97-4a56-9a13-773f7d481efd-S0
I0109 16:36:08.573177 24150 http.cpp:1185] HTTP POST for 
/slave(2)/api/v1/resource_provider from 10.0.49.2:53598
I0109 16:36:08.573974 24139 slave.cpp:7065] Handling resource provider message 
'UPDATE_OPERATION_STATUS: (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for 
framework 046b3f21-6e97-4a56-9a13-773f7d481efd- (latest state: 
OPERATION_FINISHED, status update state: OPERATION_FINISHED)'
I0109 16:36:08.574154 24139 slave.cpp:7409] Updating the state of operation ' 
with no ID (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for framework 
046b3f21-6e97-4a56-9a13-773f7d481efd- (latest state: OPERATION_FINISHED, 
status update state: OPERATION_FINISHED)
I0109 16:36:08.574785 24139 slave.cpp:7249] Forwarding status update of 
operation with no ID (operation_uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for 
framework 046b3f21-6e97-4a56-9a13-773f7d481efd-
I0109 16:36:08.583748 24084 slave.cpp:931] Agent terminating
I0109 16:36:08.584115 24144 master.cpp:1305] Agent 
046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 (core-dev) 
disconnected
[...]
I0109 16:36:08.655766 24140 slave.cpp:1378] Re-registered with master 
master@10.0.49.2:40681
I0109 16:36:08.655936 24117 task_status_update_manager.cpp:188] Resuming 
sending task status updates
I0109 16:36:08.655995 24149 hierarchical.cpp:669] Agent 
046b3f21-6e97-4a56-9a13-773f7d481efd-S0 (core-dev) updated with total resources 
cpus:2; 

[jira] [Created] (MESOS-8422) Master's UpdateSlave handler not correctly updating operations

2018-01-09 Thread JIRA
Gastón Kleiman created MESOS-8422:
-

 Summary: Master's UpdateSlave handler not correctly updating 
operations
 Key: MESOS-8422
 URL: https://issues.apache.org/jira/browse/MESOS-8422
 Project: Mesos
  Issue Type: Bug
Reporter: Gastón Kleiman






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8421) Duration operators drop precision, even when used with integers

2018-01-09 Thread Andrew Schwartzmeyer (JIRA)
Andrew Schwartzmeyer created MESOS-8421:
---

 Summary: Duration operators drop precision, even when used with 
integers
 Key: MESOS-8421
 URL: https://issues.apache.org/jira/browse/MESOS-8421
 Project: Mesos
  Issue Type: Improvement
  Components: stout
Reporter: Andrew Schwartzmeyer
Priority: Minor


The implementation of {{Duration operator*=()}} is as follows:

{noformat}
  Duration& operator*=(double multiplier)
  {
nanos = static_cast(nanos * multiplier);
return *this;
  }
{noformat}

A similar pattern is implemented for all the operators. This means that, even 
when multiplying by {{int64_t}} (underlying type of {{nanos}}), we lose 
precision.

While [Review #64729|https://reviews.apache.org/r/64729/] removes the 
conversion warnings from {{int}} and {{size_t}} to {{double}}, it purposefully 
does not address fixing the precision of these operators (as that'll be a 
change in behavior, albeit slight, and should be done for the whole class at 
once).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8078) Some fields went missing with no replacement in api/v1

2018-01-09 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319427#comment-16319427
 ] 

Greg Mann commented on MESOS-8078:
--

There are some similarly missing fields in the agent operator API. I'll follow 
up with a patch for those shortly.

> Some fields went missing with no replacement in api/v1
> --
>
> Key: MESOS-8078
> URL: https://issues.apache.org/jira/browse/MESOS-8078
> Project: Mesos
>  Issue Type: Story
>  Components: HTTP API
>Reporter: Dmitrii Rozhkov
>Assignee: Greg Mann
>Priority: Critical
>  Labels: mesosphere
>
> Hi friends, 
> These fields are available via the state.json but went missing in the v1 of 
> the API:
> -leader_info- -> available via GET_MASTER which should always return leading 
> master info
> start_time
> elected_time
> As we're showing them on the Overview page of the DC/OS UI, yet would like 
> not be using state.json, it would be great to have them somewhere in V1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8078) Some fields went missing with no replacement in api/v1

2018-01-09 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319425#comment-16319425
 ] 

Greg Mann commented on MESOS-8078:
--

Review here: https://reviews.apache.org/r/65056/

> Some fields went missing with no replacement in api/v1
> --
>
> Key: MESOS-8078
> URL: https://issues.apache.org/jira/browse/MESOS-8078
> Project: Mesos
>  Issue Type: Story
>  Components: HTTP API
>Reporter: Dmitrii Rozhkov
>Assignee: Greg Mann
>Priority: Critical
>  Labels: mesosphere
>
> Hi friends, 
> These fields are available via the state.json but went missing in the v1 of 
> the API:
> -leader_info- -> available via GET_MASTER which should always return leading 
> master info
> start_time
> elected_time
> As we're showing them on the Overview page of the DC/OS UI, yet would like 
> not be using state.json, it would be great to have them somewhere in V1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8413) Zookeeper configuration passwords are shown in clear text

2018-01-09 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-8413:
-
Shepherd: Greg Mann

> Zookeeper configuration passwords are shown in clear text
> -
>
> Key: MESOS-8413
> URL: https://issues.apache.org/jira/browse/MESOS-8413
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.4.1
>Reporter: Alexander Rojas
>Assignee: Alexander Rojas
>  Labels: mesosphere, security
>
> No matter how one configures mesos, either by passing the ZooKeeper flags in 
> the command line or using a file, as follows:
> {noformat}
> ./bin/mesos-master.sh --work_dir=/tmp/$USER/mesos/master 
> --log_dir=/tmp/$USER/mesos/master/log 
> --zk=zk://${zk_username}:${zk_password}@${zk_addr}/mesos --quorum=1
> {noformat}
> {noformat}
> echo "zk://${zk_username}:${zk_password}@${zk_addr}/mesos" > 
> /tmp/${USER}/mesos/zk_config.txt
> ./bin/mesos-master.sh --work_dir=/tmp/$USER/mesos/master 
> --log_dir=/tmp/$USER/mesos/master/log --zk=/tmp/${USER}/mesos/zk_config.txt
> {noformat}
> both the logs and the results of the {{/flags}} endpoint will resolve to the 
> contents of the flags, i.e.:
> {noformat}
> I0108 10:12:50.387522 28579 master.cpp:458] Flags at startup: 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="false" --authenticate_frameworks="false" 
> --authenticate_http_frameworks="false" --authenticate_http_readonly="false" 
> --authenticate_http_readwrite="false" --authenticators="crammd5" 
> --authorizers="local" --filter_gpu_resources="true" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --log_dir="/tmp/user/mesos/master/log" --logbufsecs="0" 
> --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
> --quorum="1" --recovery_agent_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" 
> --registry_max_agent_count="102400" --registry_store_timeout="20secs" 
> --registry_strict="false" --require_agent_domain="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/home/user/mesos/build/../src/webui" 
> --work_dir="/tmp/user/mesos/master" 
> --zk="zk://user@passwd:127.0.0.1:2181/mesos" --zk_session_timeout="10secs"
> {noformat}
> {noformat}
> HTTP/1.1 200 OK
> Content-Encoding: gzip
> Content-Length: 591
> Content-Type: application/json
> Date: Mon, 08 Jan 2018 15:12:53 GMT
> {
> "flags": {
> "agent_ping_timeout": "15secs",
> "agent_reregister_timeout": "10mins",
> "allocation_interval": "1secs",
> "allocator": "HierarchicalDRF",
> "authenticate_agents": "false",
> "authenticate_frameworks": "false",
> "authenticate_http_frameworks": "false",
> "authenticate_http_readonly": "false",
> "authenticate_http_readwrite": "false",
> "authenticators": "crammd5",
> "authorizers": "local",
> "filter_gpu_resources": "true",
> "framework_sorter": "drf",
> "help": "false",
> "hostname_lookup": "true",
> "http_authenticators": "basic",
> "initialize_driver_logging": "true",
> "log_auto_initialize": "true",
> "log_dir": "/tmp/user/mesos/master/log",
> "logbufsecs": "0",
> "logging_level": "INFO",
> "max_agent_ping_timeouts": "5",
> "max_completed_frameworks": "50",
> "max_completed_tasks_per_framework": "1000",
> "max_unreachable_tasks_per_framework": "1000",
> "port": "5050",
> "quiet": "false",
> "quorum": "1",
> "recovery_agent_removal_limit": "100%",
> "registry": "replicated_log",
> "registry_fetch_timeout": "1mins",
> "registry_gc_interval": "15mins",
> "registry_max_agent_age": "2weeks",
> "registry_max_agent_count": "102400",
> "registry_store_timeout": "20secs",
> "registry_strict": "false",
> "require_agent_domain": "false",
> "root_submissions": "true",
> "user_sorter": "drf",
> "version": "false",
> "webui_dir": "/home/user/mesos/build/../src/webui",
> "work_dir": "/tmp/user/mesos/master",
> "zk": "zk://user@passwd:127.0.0.1:2181/mesos",
> "zk_session_timeout": "10secs"
> }
> }
> {noformat}
> Which leads to having no effective way to prevent the passwords to be shown 
> 

[jira] [Updated] (MESOS-8420) Test that operation status updates are retried after being dropped en-route to the master.

2018-01-09 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/MESOS-8420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gastón Kleiman updated MESOS-8420:
--
Summary: Test that operation status updates are retried after being dropped 
en-route to the master.  (was: Verify end-to-end operation status update)

> Test that operation status updates are retried after being dropped en-route 
> to the master.
> --
>
> Key: MESOS-8420
> URL: https://issues.apache.org/jira/browse/MESOS-8420
> Project: Mesos
>  Issue Type: Task
>Reporter: Gastón Kleiman
>Assignee: Gastón Kleiman
>  Labels: mesosphere
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8420) Verify end-to-end operation status update

2018-01-09 Thread JIRA
Gastón Kleiman created MESOS-8420:
-

 Summary: Verify end-to-end operation status update
 Key: MESOS-8420
 URL: https://issues.apache.org/jira/browse/MESOS-8420
 Project: Mesos
  Issue Type: Task
Reporter: Gastón Kleiman
Assignee: Gastón Kleiman






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7803) fs::list drops path components on Windows

2018-01-09 Thread Andrew Schwartzmeyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Schwartzmeyer updated MESOS-7803:

Priority: Major  (was: Minor)

> fs::list drops path components on Windows
> -
>
> Key: MESOS-7803
> URL: https://issues.apache.org/jira/browse/MESOS-7803
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
> Environment: Windows 10
>Reporter: Andrew Schwartzmeyer
>Assignee: Andrew Schwartzmeyer
>  Labels: windows
>
> fs::list(/foo/bar/*.txt) returns a.txt, b.txt, not /foo/bar/a.txt, 
> /foo/bar/b.txt
> This breaks a ZooKeeper test on Windows.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8224) mesos.interface 1.4.0 cannot be installed with pip

2018-01-09 Thread Kapil Arya (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kapil Arya updated MESOS-8224:
--
Story Points: 1
  Sprint: Mesosphere Sprint 72

Packages available at:
https://pypi.python.org/pypi/mesos.interface/1.4.0
The following succeeds now:
{code}
python -m pip install --user mesos.interface==1.4.0
{code}

> mesos.interface 1.4.0 cannot be installed with pip
> --
>
> Key: MESOS-8224
> URL: https://issues.apache.org/jira/browse/MESOS-8224
> Project: Mesos
>  Issue Type: Task
>  Components: release
>Reporter: Bill Farner
>
> This breaks some framework development tooling.
> WIth latest pip:
> {noformat}
> $ python -m pip -V
> pip 9.0.1 from 
> /Users/wfarner/code/aurora/build-support/python/pycharm.venv/lib/python2.7/site-packages
>  (python 2.7)
> {noformat}
> This works fine for previous releases:
> {noformat}
> $ python -m pip install mesos.interface==1.3.0
> Collecting mesos.interface==1.3.0
> ...
> Installing collected packages: mesos.interface
> Successfully installed mesos.interface-1.3.0
> {noformat}
> But it does not for 1.4.0:
> {noformat}
> $ python -m pip install mesos.interface==1.4.0
> Collecting mesos.interface==1.4.0
>   Could not find a version that satisfies the requirement 
> mesos.interface==1.4.0 (from versions: 0.21.2.linux-x86_64, 
> 0.22.1.2.linux-x86_64, 0.22.2.linux-x86_64, 0.23.1.linux-x86_64, 
> 0.24.1.linux-x86_64, 0.24.2.linux-x86_64, 0.25.0.linux-x86_64, 
> 0.25.1.linux-x86_64, 0.26.1.linux-x86_64, 0.27.0.linux-x86_64, 
> 0.27.1.linux-x86_64, 0.27.2.linux-x86_64, 0.28.0.linux-x86_64, 
> 0.28.1.linux-x86_64, 0.28.2.linux-x86_64, 1.0.0.linux-x86_64, 
> 1.0.1.linux-x86_64, 1.1.0.linux-x86_64, 1.2.0.linux-x86_64, 
> 1.3.0.linux-x86_64, 0.20.0, 0.20.1, 0.21.0, 0.21.1, 0.21.2, 0.22.0, 0.22.1.2, 
> 0.22.2, 0.23.0, 0.23.1, 0.24.0, 0.24.1, 0.24.2, 0.25.0, 0.25.1, 0.26.0, 
> 0.26.1, 0.27.0, 0.27.1, 0.27.2, 0.28.0, 0.28.1, 0.28.2, 1.0.0, 1.0.1, 1.1.0, 
> 1.2.0, 1.3.0)
> No matching distribution found for mesos.interface==1.4.0
> {noformat}
> Verbose output shows that pip skips the 1.4.0 distribution:
> {noformat}
> $ python -m pip install -v mesos.interface==1.4.0 | grep 1.4.0
> Collecting mesos.interface==1.4.0
> Skipping link 
> https://pypi.python.org/packages/ef/1b/d5b0c1456f755ad42477eaa9667e22d1f5fd8e2fce0f9b26937f93743f6c/mesos.interface-1.4.0-py2.7.egg#md5=32113860961d49c31f69f7b13a9bc063
>  (from https://pypi.python.org/simple/mesos-interface/); unsupported archive 
> format: .egg
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8078) Some fields went missing with no replacement in api/v1

2018-01-09 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319344#comment-16319344
 ] 

Greg Mann commented on MESOS-8078:
--

[~drozhkov], thanks for the ping. I'm working on this issue today, hoping to 
post a patch by EOD.

> Some fields went missing with no replacement in api/v1
> --
>
> Key: MESOS-8078
> URL: https://issues.apache.org/jira/browse/MESOS-8078
> Project: Mesos
>  Issue Type: Story
>  Components: HTTP API
>Reporter: Dmitrii Rozhkov
>Assignee: Greg Mann
>Priority: Critical
>  Labels: mesosphere
>
> Hi friends, 
> These fields are available via the state.json but went missing in the v1 of 
> the API:
> -leader_info- -> available via GET_MASTER which should always return leading 
> master info
> start_time
> elected_time
> As we're showing them on the Overview page of the DC/OS UI, yet would like 
> not be using state.json, it would be great to have them somewhere in V1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8078) Some fields went missing with no replacement in api/v1

2018-01-09 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-8078:
-
Shepherd: Vinod Kone  (was: Greg Mann)

> Some fields went missing with no replacement in api/v1
> --
>
> Key: MESOS-8078
> URL: https://issues.apache.org/jira/browse/MESOS-8078
> Project: Mesos
>  Issue Type: Story
>  Components: HTTP API
>Reporter: Dmitrii Rozhkov
>Assignee: Greg Mann
>Priority: Critical
>  Labels: mesosphere
>
> Hi friends, 
> These fields are available via the state.json but went missing in the v1 of 
> the API:
> -leader_info- -> available via GET_MASTER which should always return leading 
> master info
> start_time
> elected_time
> As we're showing them on the Overview page of the DC/OS UI, yet would like 
> not be using state.json, it would be great to have them somewhere in V1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8419) RP manager incorrectly setting framework ID leads to CHECK failure

2018-01-09 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-8419:


Assignee: Greg Mann

> RP manager incorrectly setting framework ID leads to CHECK failure
> --
>
> Key: MESOS-8419
> URL: https://issues.apache.org/jira/browse/MESOS-8419
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Blocker
>  Labels: mesosphere
>
> The resource provider manager [unconditionally sets the framework 
> ID|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/resource_provider/manager.cpp#L637]
>  when forwarding operation status updates to the agent. This is incorrect, 
> for example, when the resource provider [generates OPERATION_DROPPED updates 
> during 
> reconciliation|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/resource_provider/storage/provider.cpp#L1653-L1657],
>  and leads to protobuf errors in this case since the framework ID's required 
> {{value}} field is left unset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used

2018-01-09 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8356:
--
Fix Version/s: 1.3.2

> Persistent volume ownership is set to root despite of sandbox owner 
> (frameworkInfo.user) when docker executor is used
> -
>
> Key: MESOS-8356
> URL: https://issues.apache.org/jira/browse/MESOS-8356
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.3, 1.2.3, 1.3.1, 1.4.1
> Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13
>Reporter: Konstantin Kalin
>Assignee: Jie Yu
>Priority: Critical
>  Labels: persistent-volumes
> Fix For: 1.3.2, 1.4.2, 1.5.0
>
>
> PersistentVolume ownership is not set to match the sandbox user when the 
> docker executor is used. Looks like the issue was introduced by 
> https://reviews.apache.org/r/45963/
> I didn't check the universal containerizer yet. 
> As far as I understand the following code is supposed to check that a volume 
> is not being already used by other tasks/containers.
> src/slave/containerizer/docker.cpp
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource)) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}
> But it doesn't exclude a container to be launch (In my case I have only one 
> container - no group of tasks). Thus the ownership of PersistentVolume stays 
> "root" (I run mesos-agent under root) and it's impossible to use the volume 
> inside the container. We always run processes inside Docker containers under 
> unprivileged user. 
> Making a small patch to exclude the container to launch fixes the issue.
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource) &&
>   containerId != container->id) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used

2018-01-09 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8356:
--
Target Version/s: 1.3.2, 1.4.2, 1.5.1  (was: 1.4.2, 1.5.1)

> Persistent volume ownership is set to root despite of sandbox owner 
> (frameworkInfo.user) when docker executor is used
> -
>
> Key: MESOS-8356
> URL: https://issues.apache.org/jira/browse/MESOS-8356
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.3, 1.2.3, 1.3.1, 1.4.1
> Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13
>Reporter: Konstantin Kalin
>Assignee: Jie Yu
>Priority: Critical
>  Labels: persistent-volumes
> Fix For: 1.4.2, 1.5.0
>
>
> PersistentVolume ownership is not set to match the sandbox user when the 
> docker executor is used. Looks like the issue was introduced by 
> https://reviews.apache.org/r/45963/
> I didn't check the universal containerizer yet. 
> As far as I understand the following code is supposed to check that a volume 
> is not being already used by other tasks/containers.
> src/slave/containerizer/docker.cpp
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource)) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}
> But it doesn't exclude a container to be launch (In my case I have only one 
> container - no group of tasks). Thus the ownership of PersistentVolume stays 
> "root" (I run mesos-agent under root) and it's impossible to use the volume 
> inside the container. We always run processes inside Docker containers under 
> unprivileged user. 
> Making a small patch to exclude the container to launch fixes the issue.
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource) &&
>   containerId != container->id) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used

2018-01-09 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8356:
--
Fix Version/s: 1.4.2

> Persistent volume ownership is set to root despite of sandbox owner 
> (frameworkInfo.user) when docker executor is used
> -
>
> Key: MESOS-8356
> URL: https://issues.apache.org/jira/browse/MESOS-8356
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.3, 1.2.3, 1.3.1, 1.4.1
> Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13
>Reporter: Konstantin Kalin
>Assignee: Jie Yu
>Priority: Critical
>  Labels: persistent-volumes
> Fix For: 1.4.2, 1.5.0
>
>
> PersistentVolume ownership is not set to match the sandbox user when the 
> docker executor is used. Looks like the issue was introduced by 
> https://reviews.apache.org/r/45963/
> I didn't check the universal containerizer yet. 
> As far as I understand the following code is supposed to check that a volume 
> is not being already used by other tasks/containers.
> src/slave/containerizer/docker.cpp
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource)) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}
> But it doesn't exclude a container to be launch (In my case I have only one 
> container - no group of tasks). Thus the ownership of PersistentVolume stays 
> "root" (I run mesos-agent under root) and it's impossible to use the volume 
> inside the container. We always run processes inside Docker containers under 
> unprivileged user. 
> Making a small patch to exclude the container to launch fixes the issue.
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource) &&
>   containerId != container->id) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used

2018-01-09 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319113#comment-16319113
 ] 

Jie Yu commented on MESOS-8356:
---

commit c8e6487d251d938c3c221f606f7e924514877655 (origin/master, origin/HEAD, 
master)
Author: Jie Yu 
Date:   Tue Jan 9 11:23:20 2018 -0800

Fixed the persistent volume permission issue in DockerContainerizer.

This patch fixes MESOS-8356 by skipping the current container to be
launched when doing the shared volume check (`isVolumeInUse`). Prior to
this patch, the code is buggy because `isVolumeInUse` will always be set
to `true`.

Review: https://reviews.apache.org/r/65049

> Persistent volume ownership is set to root despite of sandbox owner 
> (frameworkInfo.user) when docker executor is used
> -
>
> Key: MESOS-8356
> URL: https://issues.apache.org/jira/browse/MESOS-8356
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.3, 1.2.3, 1.3.1, 1.4.1
> Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13
>Reporter: Konstantin Kalin
>Assignee: Jie Yu
>Priority: Critical
>  Labels: persistent-volumes
> Fix For: 1.5.0
>
>
> PersistentVolume ownership is not set to match the sandbox user when the 
> docker executor is used. Looks like the issue was introduced by 
> https://reviews.apache.org/r/45963/
> I didn't check the universal containerizer yet. 
> As far as I understand the following code is supposed to check that a volume 
> is not being already used by other tasks/containers.
> src/slave/containerizer/docker.cpp
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource)) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}
> But it doesn't exclude a container to be launch (In my case I have only one 
> container - no group of tasks). Thus the ownership of PersistentVolume stays 
> "root" (I run mesos-agent under root) and it's impossible to use the volume 
> inside the container. We always run processes inside Docker containers under 
> unprivileged user. 
> Making a small patch to exclude the container to launch fixes the issue.
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource) &&
>   containerId != container->id) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used

2018-01-09 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8356:
--
Fix Version/s: 1.5.0

> Persistent volume ownership is set to root despite of sandbox owner 
> (frameworkInfo.user) when docker executor is used
> -
>
> Key: MESOS-8356
> URL: https://issues.apache.org/jira/browse/MESOS-8356
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.3, 1.2.3, 1.3.1, 1.4.1
> Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13
>Reporter: Konstantin Kalin
>Assignee: Jie Yu
>Priority: Critical
>  Labels: persistent-volumes
> Fix For: 1.5.0
>
>
> PersistentVolume ownership is not set to match the sandbox user when the 
> docker executor is used. Looks like the issue was introduced by 
> https://reviews.apache.org/r/45963/
> I didn't check the universal containerizer yet. 
> As far as I understand the following code is supposed to check that a volume 
> is not being already used by other tasks/containers.
> src/slave/containerizer/docker.cpp
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource)) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}
> But it doesn't exclude a container to be launch (In my case I have only one 
> container - no group of tasks). Thus the ownership of PersistentVolume stays 
> "root" (I run mesos-agent under root) and it's impossible to use the volume 
> inside the container. We always run processes inside Docker containers under 
> unprivileged user. 
> Making a small patch to exclude the container to launch fixes the issue.
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource) &&
>   containerId != container->id) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8078) Some fields went missing with no replacement in api/v1

2018-01-09 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-8078:
-
Sprint: Mesosphere Sprint 66, Mesosphere Sprint 67, Mesosphere Sprint 68, 
Mesosphere Sprint 72  (was: Mesosphere Sprint 66, Mesosphere Sprint 67, 
Mesosphere Sprint 68)

> Some fields went missing with no replacement in api/v1
> --
>
> Key: MESOS-8078
> URL: https://issues.apache.org/jira/browse/MESOS-8078
> Project: Mesos
>  Issue Type: Story
>  Components: HTTP API
>Reporter: Dmitrii Rozhkov
>Assignee: Greg Mann
>Priority: Critical
>  Labels: mesosphere
>
> Hi friends, 
> These fields are available via the state.json but went missing in the v1 of 
> the API:
> -leader_info- -> available via GET_MASTER which should always return leading 
> master info
> start_time
> elected_time
> As we're showing them on the Overview page of the DC/OS UI, yet would like 
> not be using state.json, it would be great to have them somewhere in V1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8078) Some fields went missing with no replacement in api/v1

2018-01-09 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-8078:


Assignee: Greg Mann  (was: Vinod Kone)

> Some fields went missing with no replacement in api/v1
> --
>
> Key: MESOS-8078
> URL: https://issues.apache.org/jira/browse/MESOS-8078
> Project: Mesos
>  Issue Type: Story
>  Components: HTTP API
>Reporter: Dmitrii Rozhkov
>Assignee: Greg Mann
>Priority: Critical
>  Labels: mesosphere
>
> Hi friends, 
> These fields are available via the state.json but went missing in the v1 of 
> the API:
> -leader_info- -> available via GET_MASTER which should always return leading 
> master info
> start_time
> elected_time
> As we're showing them on the Overview page of the DC/OS UI, yet would like 
> not be using state.json, it would be great to have them somewhere in V1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used

2018-01-09 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318943#comment-16318943
 ] 

Jie Yu commented on MESOS-8356:
---

I verified that it's not an issue with Mesos containerizer (aka, universal 
containerizer), but it's a problem for docker containerizer.

> Persistent volume ownership is set to root despite of sandbox owner 
> (frameworkInfo.user) when docker executor is used
> -
>
> Key: MESOS-8356
> URL: https://issues.apache.org/jira/browse/MESOS-8356
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.3, 1.2.3, 1.3.1, 1.4.1
> Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13
>Reporter: Konstantin Kalin
>Assignee: Jie Yu
>Priority: Critical
>  Labels: persistent-volumes
>
> PersistentVolume ownership is not set to match the sandbox user when the 
> docker executor is used. Looks like the issue was introduced by 
> https://reviews.apache.org/r/45963/
> I didn't check the universal containerizer yet. 
> As far as I understand the following code is supposed to check that a volume 
> is not being already used by other tasks/containers.
> src/slave/containerizer/docker.cpp
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource)) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}
> But it doesn't exclude a container to be launch (In my case I have only one 
> container - no group of tasks). Thus the ownership of PersistentVolume stays 
> "root" (I run mesos-agent under root) and it's impossible to use the volume 
> inside the container. We always run processes inside Docker containers under 
> unprivileged user. 
> Making a small patch to exclude the container to launch fixes the issue.
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource) &&
>   containerId != container->id) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used

2018-01-09 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8356:
--
Affects Version/s: 1.1.3
   1.2.3
   1.3.1
 Target Version/s: 1.4.2, 1.5.1

> Persistent volume ownership is set to root despite of sandbox owner 
> (frameworkInfo.user) when docker executor is used
> -
>
> Key: MESOS-8356
> URL: https://issues.apache.org/jira/browse/MESOS-8356
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.3, 1.2.3, 1.3.1, 1.4.1
> Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13
>Reporter: Konstantin Kalin
>Assignee: Jie Yu
>  Labels: persistent-volumes
>
> PersistentVolume ownership is not set to match the sandbox user when the 
> docker executor is used. Looks like the issue was introduced by 
> https://reviews.apache.org/r/45963/
> I didn't check the universal containerizer yet. 
> As far as I understand the following code is supposed to check that a volume 
> is not being already used by other tasks/containers.
> src/slave/containerizer/docker.cpp
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource)) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}
> But it doesn't exclude a container to be launch (In my case I have only one 
> container - no group of tasks). Thus the ownership of PersistentVolume stays 
> "root" (I run mesos-agent under root) and it's impossible to use the volume 
> inside the container. We always run processes inside Docker containers under 
> unprivileged user. 
> Making a small patch to exclude the container to launch fixes the issue.
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource) &&
>   containerId != container->id) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used

2018-01-09 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8356:
--
Priority: Critical  (was: Major)

> Persistent volume ownership is set to root despite of sandbox owner 
> (frameworkInfo.user) when docker executor is used
> -
>
> Key: MESOS-8356
> URL: https://issues.apache.org/jira/browse/MESOS-8356
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.3, 1.2.3, 1.3.1, 1.4.1
> Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13
>Reporter: Konstantin Kalin
>Assignee: Jie Yu
>Priority: Critical
>  Labels: persistent-volumes
>
> PersistentVolume ownership is not set to match the sandbox user when the 
> docker executor is used. Looks like the issue was introduced by 
> https://reviews.apache.org/r/45963/
> I didn't check the universal containerizer yet. 
> As far as I understand the following code is supposed to check that a volume 
> is not being already used by other tasks/containers.
> src/slave/containerizer/docker.cpp
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource)) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}
> But it doesn't exclude a container to be launch (In my case I have only one 
> container - no group of tasks). Thus the ownership of PersistentVolume stays 
> "root" (I run mesos-agent under root) and it's impossible to use the volume 
> inside the container. We always run processes inside Docker containers under 
> unprivileged user. 
> Making a small patch to exclude the container to launch fixes the issue.
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource) &&
>   containerId != container->id) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used

2018-01-09 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318907#comment-16318907
 ] 

Jie Yu edited comment on MESOS-8356 at 1/9/18 6:45 PM:
---

[~kkalin] Thanks for reporting!

[~xujyan] This looks like a bug to me because `current` is always set to empty 
in Docker containerizer:
https://github.com/apache/mesos/blob/1.4.x/src/slave/containerizer/docker.cpp#L625

The logic in the Mesos containerizer (i.e., filesystem/linux isolator) is 
slightly different as `current` there is set to be `info->resources`, thus not 
buggy
https://github.com/apache/mesos/blob/1.4.x/src/slave/containerizer/mesos/isolators/filesystem/linux.cpp#L644




was (Author: jieyu):
[~kkalin] Thanks for reporting!

[~xujyan] This looks like a bug to me because `current` is always set to empty 
in Docker containerizer:
https://github.com/apache/mesos/blob/1.4.x/src/slave/containerizer/docker.cpp#L625

The logic in the Mesos containerizer (i.e., filesystem/linux isolator) is 
slightly different as `current` there is set to be `info->resources`:
https://github.com/apache/mesos/blob/1.4.x/src/slave/containerizer/mesos/isolators/filesystem/linux.cpp#L644



> Persistent volume ownership is set to root despite of sandbox owner 
> (frameworkInfo.user) when docker executor is used
> -
>
> Key: MESOS-8356
> URL: https://issues.apache.org/jira/browse/MESOS-8356
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.3, 1.2.3, 1.3.1, 1.4.1
> Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13
>Reporter: Konstantin Kalin
>Assignee: Jie Yu
>Priority: Critical
>  Labels: persistent-volumes
>
> PersistentVolume ownership is not set to match the sandbox user when the 
> docker executor is used. Looks like the issue was introduced by 
> https://reviews.apache.org/r/45963/
> I didn't check the universal containerizer yet. 
> As far as I understand the following code is supposed to check that a volume 
> is not being already used by other tasks/containers.
> src/slave/containerizer/docker.cpp
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource)) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}
> But it doesn't exclude a container to be launch (In my case I have only one 
> container - no group of tasks). Thus the ownership of PersistentVolume stays 
> "root" (I run mesos-agent under root) and it's impossible to use the volume 
> inside the container. We always run processes inside Docker containers under 
> unprivileged user. 
> Making a small patch to exclude the container to launch fixes the issue.
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource) &&
>   containerId != container->id) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used

2018-01-09 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318907#comment-16318907
 ] 

Jie Yu commented on MESOS-8356:
---

[~kkalin] Thanks for reporting!

[~xujyan] This looks like a bug to me because `current` is always set to empty 
in Docker containerizer:
https://github.com/apache/mesos/blob/1.4.x/src/slave/containerizer/docker.cpp#L625

The logic in the Mesos containerizer (i.e., filesystem/linux isolator) is 
slightly different as `current` there is set to be `info->resources`:
https://github.com/apache/mesos/blob/1.4.x/src/slave/containerizer/mesos/isolators/filesystem/linux.cpp#L644



> Persistent volume ownership is set to root despite of sandbox owner 
> (frameworkInfo.user) when docker executor is used
> -
>
> Key: MESOS-8356
> URL: https://issues.apache.org/jira/browse/MESOS-8356
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.4.1
> Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13
>Reporter: Konstantin Kalin
>Assignee: Jie Yu
>  Labels: persistent-volumes
>
> PersistentVolume ownership is not set to match the sandbox user when the 
> docker executor is used. Looks like the issue was introduced by 
> https://reviews.apache.org/r/45963/
> I didn't check the universal containerizer yet. 
> As far as I understand the following code is supposed to check that a volume 
> is not being already used by other tasks/containers.
> src/slave/containerizer/docker.cpp
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource)) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}
> But it doesn't exclude a container to be launch (In my case I have only one 
> container - no group of tasks). Thus the ownership of PersistentVolume stays 
> "root" (I run mesos-agent under root) and it's impossible to use the volume 
> inside the container. We always run processes inside Docker containers under 
> unprivileged user. 
> Making a small patch to exclude the container to launch fixes the issue.
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource) &&
>   containerId != container->id) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8125) Agent should properly handle recovering an executor when its pid is reused

2018-01-09 Thread Megha Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Megha Sharma reassigned MESOS-8125:
---

Assignee: Megha Sharma

> Agent should properly handle recovering an executor when its pid is reused
> --
>
> Key: MESOS-8125
> URL: https://issues.apache.org/jira/browse/MESOS-8125
> Project: Mesos
>  Issue Type: Bug
>Reporter: Gastón Kleiman
>Assignee: Megha Sharma
>Priority: Critical
>
> We know that all executors will be gone once the host on which an agent is 
> running is rebooted, so there's no need to try to recover these executors.
> Trying to recover stopped executors can lead to problems if another process 
> is assigned the same pid that the executor had before the reboot. In this 
> case the agent will unsuccessfully try to reregister with the executor, and 
> then transition it to a {{TERMINATING}} state. The executor will sadly get 
> stuck in that state, and the tasks that it started will get stuck in whatever 
> state they were in at the time of the reboot.
> One way of getting rid of stuck executors is to remove the {{latest}} symlink 
> under {{work_dir/meta/slaves/latest/frameworks/ id>/executors//runs}.
> Here's how to reproduce this issue:
> # Start a task using the Docker containerizer (the same will probably happen 
> with the command executor).
> # Stop the corresponding Mesos agent while the task is running.
> # Change the executor's checkpointed forked pid, which is located in the meta 
> directory, e.g., 
> {{/var/lib/mesos/slave/meta/slaves/latest/frameworks/19faf6e0-3917-48ab-8b8e-97ec4f9ed41e-0001/executors/foo.13faee90-b5f0-11e7-8032-e607d2b4348c/runs/latest/pids/forked.pid}}.
>  I used pid 2, which is normally used by {{kthreadd}}.
> # Reboot the host



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8348) Enable function sections in the build.

2018-01-09 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318864#comment-16318864
 ] 

James Peach commented on MESOS-8348:


No apparent performance difference with a quick and arbitrary benchmark.

*Without GC unused sections:*

{noformat}
[--] 3 tests from AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test
[ RUN  ] 
AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/0
Starting reregistration for all agents
Reregistered 2000 agents with a total of 10 running tasks and 10 
completed tasks in 28.812622779secs
[   OK ] 
AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/0
 (60329 ms)
[ RUN  ] 
AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/1
Starting reregistration for all agents
Reregistered 2000 agents with a total of 20 running tasks and 0 completed 
tasks in 39.378296252secs
[   OK ] 
AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/1
 (98509 ms)
[ RUN  ] 
AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/2
Starting reregistration for all agents
Reregistered 2 agents with a total of 10 running tasks and 0 completed 
tasks in 45.240454686secs
[   OK ] 
AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/2
 (80371 ms)
[--] 3 tests from AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test 
(239209 ms total)
{noformat}

*With GC unused sections:*

{noformat}
[--] 3 tests from AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test
[ RUN  ] 
AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/0
Starting reregistration for all agents
Reregistered 2000 agents with a total of 10 running tasks and 10 
completed tasks in 28.751620417secs
[   OK ] 
AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/0
 (59282 ms)
[ RUN  ] 
AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/1
Starting reregistration for all agents
Reregistered 2000 agents with a total of 20 running tasks and 0 completed 
tasks in 40.010202034secs
[   OK ] 
AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/1
 (96938 ms)
[ RUN  ] 
AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/2
Starting reregistration for all agents
Reregistered 2 agents with a total of 10 running tasks and 0 completed 
tasks in 44.541095336secs
[   OK ] 
AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/2
 (79331 ms)
[--] 3 tests from AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test 
(235551 ms total)
{noformat}


> Enable function sections in the build.
> --
>
> Key: MESOS-8348
> URL: https://issues.apache.org/jira/browse/MESOS-8348
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: James Peach
>Assignee: James Peach
>
> Enable {{-ffunction-sections}} to improve the ability of the toolchain to 
> remove unused code.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used

2018-01-09 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8356:
--
Target Version/s:   (was: 1.4.2, 1.5.1)

> Persistent volume ownership is set to root despite of sandbox owner 
> (frameworkInfo.user) when docker executor is used
> -
>
> Key: MESOS-8356
> URL: https://issues.apache.org/jira/browse/MESOS-8356
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.4.1
> Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13
>Reporter: Konstantin Kalin
>Assignee: Jie Yu
>  Labels: persistent-volumes
>
> PersistentVolume ownership is not set to match the sandbox user when the 
> docker executor is used. Looks like the issue was introduced by 
> https://reviews.apache.org/r/45963/
> I didn't check the universal containerizer yet. 
> As far as I understand the following code is supposed to check that a volume 
> is not being already used by other tasks/containers.
> src/slave/containerizer/docker.cpp
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource)) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}
> But it doesn't exclude a container to be launch (In my case I have only one 
> container - no group of tasks). Thus the ownership of PersistentVolume stays 
> "root" (I run mesos-agent under root) and it's impossible to use the volume 
> inside the container. We always run processes inside Docker containers under 
> unprivileged user. 
> Making a small patch to exclude the container to launch fixes the issue.
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource) &&
>   containerId != container->id) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used

2018-01-09 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8356:
--
Target Version/s: 1.4.2, 1.5.1

> Persistent volume ownership is set to root despite of sandbox owner 
> (frameworkInfo.user) when docker executor is used
> -
>
> Key: MESOS-8356
> URL: https://issues.apache.org/jira/browse/MESOS-8356
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.4.1
> Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13
>Reporter: Konstantin Kalin
>Assignee: Jie Yu
>  Labels: persistent-volumes
>
> PersistentVolume ownership is not set to match the sandbox user when the 
> docker executor is used. Looks like the issue was introduced by 
> https://reviews.apache.org/r/45963/
> I didn't check the universal containerizer yet. 
> As far as I understand the following code is supposed to check that a volume 
> is not being already used by other tasks/containers.
> src/slave/containerizer/docker.cpp
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource)) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}
> But it doesn't exclude a container to be launch (In my case I have only one 
> container - no group of tasks). Thus the ownership of PersistentVolume stays 
> "root" (I run mesos-agent under root) and it's impossible to use the volume 
> inside the container. We always run processes inside Docker containers under 
> unprivileged user. 
> Making a small patch to exclude the container to launch fixes the issue.
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource) &&
>   containerId != container->id) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used

2018-01-09 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8356:
--
Affects Version/s: 1.4.1

> Persistent volume ownership is set to root despite of sandbox owner 
> (frameworkInfo.user) when docker executor is used
> -
>
> Key: MESOS-8356
> URL: https://issues.apache.org/jira/browse/MESOS-8356
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.4.1
> Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13
>Reporter: Konstantin Kalin
>Assignee: Jie Yu
>  Labels: persistent-volumes
>
> PersistentVolume ownership is not set to match the sandbox user when the 
> docker executor is used. Looks like the issue was introduced by 
> https://reviews.apache.org/r/45963/
> I didn't check the universal containerizer yet. 
> As far as I understand the following code is supposed to check that a volume 
> is not being already used by other tasks/containers.
> src/slave/containerizer/docker.cpp
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource)) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}
> But it doesn't exclude a container to be launch (In my case I have only one 
> container - no group of tasks). Thus the ownership of PersistentVolume stays 
> "root" (I run mesos-agent under root) and it's impossible to use the volume 
> inside the container. We always run processes inside Docker containers under 
> unprivileged user. 
> Making a small patch to exclude the container to launch fixes the issue.
> {code:java}
> foreachvalue (const Container* container, containers_) {
>   if (container->resources.contains(resource) &&
>   containerId != container->id) {
> isVolumeInUse = true;
> break;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6595) As a Mesos user I want to launch processes that will run on every node in the cluster

2018-01-09 Thread Sampsa Tuokko (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318802#comment-16318802
 ] 

Sampsa Tuokko commented on MESOS-6595:
--

Any progress at all on this? DaemonSets in Kubernetes world are a heavily used 
feature, this would be immensely useful from operations perspective.

> As a Mesos user I want to launch processes that will run on every node in the 
> cluster
> -
>
> Key: MESOS-6595
> URL: https://issues.apache.org/jira/browse/MESOS-6595
> Project: Mesos
>  Issue Type: Story
>Reporter: James DeFelice
>  Labels: mesosphere
>
> Some applicable use cases:
> - log collection
> - metrics and monitoring
> - service discovery
> It might also be useful to break this functionality down into: daemon 
> processes for master nodes vs. daemon processes for agent nodes.
> There was some initial discussion and back-of-the-napkin design for this at 
> Mesoscon this past year (with an emphasis on agent nodes) but I'm not aware 
> that anything significant materialized from that.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7854) Authorize resource calls to provider manager api

2018-01-09 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-7854:

Description: 
The resource provider manager provides a function
{code}
process::Future api(
const process::http::Request& request,
const Option& principal) const;
{code}
which is exposed e.g., as an agent endpoint.

We need to add authorization to this function in order to e.g., stop rough 
callers.

  was:
The resource provider manager provides a function
{code}
process::Future api(
const process::http::Request& request,
const Option& principal) const;
{code}
which is expose e.g., as an agent endpoint.

We need to add authorization to this function in order to e.g., stop rough 
callers.


> Authorize resource calls to provider manager api
> 
>
> Key: MESOS-7854
> URL: https://issues.apache.org/jira/browse/MESOS-7854
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Bannier
>Priority: Critical
>  Labels: csi-post-mvp, mesosphere, storage
>
> The resource provider manager provides a function
> {code}
> process::Future api(
> const process::http::Request& request,
> const Option& principal) const;
> {code}
> which is exposed e.g., as an agent endpoint.
> We need to add authorization to this function in order to e.g., stop rough 
> callers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7558) Add resource provider validation

2018-01-09 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-7558:

Story Points: 2

> Add resource provider validation
> 
>
> Key: MESOS-7558
> URL: https://issues.apache.org/jira/browse/MESOS-7558
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Jan Schlicht
>  Labels: mesosphere, storage
>
> Similar to how it's done during agent registration/re-registration, the 
> informations provided by a resource provider need to get validation during 
> certain operation (e.g. re-registration, while applying offer operations, 
> ...).
> Some of these validations only cover the provided informations (e.g. are the 
> resources in {{ResourceProviderInfo}} only of type {{disk}}), others take the 
> current cluster state into account (e.g. do the resources that a task wants 
> to use exist on the resource provider).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7558) Add resource provider validation

2018-01-09 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-7558:

Story Points: 3  (was: 2)

> Add resource provider validation
> 
>
> Key: MESOS-7558
> URL: https://issues.apache.org/jira/browse/MESOS-7558
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Jan Schlicht
>  Labels: mesosphere, storage
>
> Similar to how it's done during agent registration/re-registration, the 
> informations provided by a resource provider need to get validation during 
> certain operation (e.g. re-registration, while applying offer operations, 
> ...).
> Some of these validations only cover the provided informations (e.g. are the 
> resources in {{ResourceProviderInfo}} only of type {{disk}}), others take the 
> current cluster state into account (e.g. do the resources that a task wants 
> to use exist on the resource provider).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7329) Authorize offer operations for converting disk resources

2018-01-09 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-7329:

Story Points: 3

> Authorize offer operations for converting disk resources
> 
>
> Key: MESOS-7329
> URL: https://issues.apache.org/jira/browse/MESOS-7329
> Project: Mesos
>  Issue Type: Task
>  Components: master, security
>Reporter: Jan Schlicht
>  Labels: csi-post-mvp, mesosphere, security, storage
>
> All offer operations are authorized, hence authorization logic has to be 
> added to new offer operations as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7742) ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky

2018-01-09 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318666#comment-16318666
 ] 

Andrei Budnik commented on MESOS-7742:
--

As we have launched 
[`cat`|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/tests/api_tests.cpp#L6529]
 command as a nested container, related ioswitchboard process will be in the 
same process group. Whenever a process group leader ({{cat}}) terminates, all 
processes in the process group are killed, including ioswitchboard.
ioswitchboard handles HTTP requests from the slave, e.g. 
{{ATTACH_CONTAINER_INPUT}} request in this test.
Usually, after reading all client's data, {{Http::_attachContainerInput()}} 
invokes a callback which calls 
[writer.close()|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/slave/http.cpp#L3223].
[writer.close()|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L561]
 implies sending a 
[\r\n\r\n|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L1045]
 to the ioswitchboard process.
ioswitchboard returns [200 
OK|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/slave/containerizer/mesos/io/switchboard.cpp#L1572]
 response, hence agent returns {{200 OK}} for {{ATTACH_CONTAINER_INPUT}} 
request as expected.

However, if ioswitchboard terminates before it receives {{\r\n\r\n}} or before 
agent receives {{200 OK}} response from the ioswitchboard, connection (via unix 
socket) might be closed, so corresponding {{ConnectionProcess}} will handle 
this case as an unexpected [EOF| 
https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L1293
 
https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L1293]
 during 
[read|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L1216]
 of a response. That will lead to {{500 Internal Server Error}} response from 
the agent.

> ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky
> --
>
> Key: MESOS-7742
> URL: https://issues.apache.org/jira/browse/MESOS-7742
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Andrei Budnik
>  Labels: flaky-test, mesosphere-oncall
> Attachments: AgentAPITest.LaunchNestedContainerSession-badrun.txt, 
> LaunchNestedContainerSessionDisconnected-badrun.txt
>
>
> Observed this on ASF CI and internal Mesosphere CI. Affected tests:
> {noformat}
> AgentAPIStreamingTest.AttachInputToNestedContainerSession
> AgentAPITest.LaunchNestedContainerSession
> AgentAPITest.AttachContainerInputAuthorization/0
> AgentAPITest.LaunchNestedContainerSessionWithTTY/0
> AgentAPITest.LaunchNestedContainerSessionDisconnected/1
> {noformat}
> This issue comes at least in three different flavours. Take 
> {{AgentAPIStreamingTest.AttachInputToNestedContainerSession}} as an example.
> h5. Flavour 1
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "503 Service Unavailable"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}
> h5. Flavour 2
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: "Disconnected"
> {noformat}
> h5. Flavour 3
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/api_tests.cpp:6367
> Value of: (sessionResponse).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7742) ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky

2018-01-09 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318567#comment-16318567
 ] 

Andrei Budnik commented on MESOS-7742:
--

How to reproduce Flavour 3:
Put a {{::sleep(1);}} before {{writer.close();}} in 
[Http::_attachContainerInput()|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/slave/http.cpp#L3222].

> ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky
> --
>
> Key: MESOS-7742
> URL: https://issues.apache.org/jira/browse/MESOS-7742
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Andrei Budnik
>  Labels: flaky-test, mesosphere-oncall
> Attachments: AgentAPITest.LaunchNestedContainerSession-badrun.txt, 
> LaunchNestedContainerSessionDisconnected-badrun.txt
>
>
> Observed this on ASF CI and internal Mesosphere CI. Affected tests:
> {noformat}
> AgentAPIStreamingTest.AttachInputToNestedContainerSession
> AgentAPITest.LaunchNestedContainerSession
> AgentAPITest.AttachContainerInputAuthorization/0
> AgentAPITest.LaunchNestedContainerSessionWithTTY/0
> AgentAPITest.LaunchNestedContainerSessionDisconnected/1
> {noformat}
> This issue comes at least in three different flavours. Take 
> {{AgentAPIStreamingTest.AttachInputToNestedContainerSession}} as an example.
> h5. Flavour 1
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "503 Service Unavailable"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}
> h5. Flavour 2
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: "Disconnected"
> {noformat}
> h5. Flavour 3
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/api_tests.cpp:6367
> Value of: (sessionResponse).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8388) Show LRP resources in master endpoints.

2018-01-09 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-8388:

Story Points: 2
 Component/s: master

> Show LRP resources in master endpoints.
> ---
>
> Key: MESOS-8388
> URL: https://issues.apache.org/jira/browse/MESOS-8388
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Jie Yu
>
> Currently, only resource provider info is shown. We should also shown the 
> resources provided by the RP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8078) Some fields went missing with no replacement in api/v1

2018-01-09 Thread Dmitrii Rozhkov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318459#comment-16318459
 ] 

Dmitrii Rozhkov commented on MESOS-8078:


Hey [~vinodkone], would you please give an update on the issue? It's kind of 
critical since we're approaching release. Thanks!

> Some fields went missing with no replacement in api/v1
> --
>
> Key: MESOS-8078
> URL: https://issues.apache.org/jira/browse/MESOS-8078
> Project: Mesos
>  Issue Type: Story
>  Components: HTTP API
>Reporter: Dmitrii Rozhkov
>Assignee: Vinod Kone
>Priority: Critical
>  Labels: mesosphere
>
> Hi friends, 
> These fields are available via the state.json but went missing in the v1 of 
> the API:
> -leader_info- -> available via GET_MASTER which should always return leading 
> master info
> start_time
> elected_time
> As we're showing them on the Overview page of the DC/OS UI, yet would like 
> not be using state.json, it would be great to have them somewhere in V1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8078) Some fields went missing with no replacement in api/v1

2018-01-09 Thread Dmitrii Rozhkov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitrii Rozhkov updated MESOS-8078:
---
Priority: Critical  (was: Major)

> Some fields went missing with no replacement in api/v1
> --
>
> Key: MESOS-8078
> URL: https://issues.apache.org/jira/browse/MESOS-8078
> Project: Mesos
>  Issue Type: Story
>  Components: HTTP API
>Reporter: Dmitrii Rozhkov
>Assignee: Vinod Kone
>Priority: Critical
>  Labels: mesosphere
>
> Hi friends, 
> These fields are available via the state.json but went missing in the v1 of 
> the API:
> -leader_info- -> available via GET_MASTER which should always return leading 
> master info
> start_time
> elected_time
> As we're showing them on the Overview page of the DC/OS UI, yet would like 
> not be using state.json, it would be great to have them somewhere in V1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8382) Master should bookkeep local resource providers.

2018-01-09 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-8382:

Story Points: 5

> Master should bookkeep local resource providers.
> 
>
> Key: MESOS-8382
> URL: https://issues.apache.org/jira/browse/MESOS-8382
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Benjamin Bannier
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> This will simplify the handling of `UpdateSlaveMessage`. ALso, it'll simplify 
> the endpoint serving.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7506) Multiple tests leave orphan containers.

2018-01-09 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-7506:
-
Attachment: ROOT_IsolatorFlags-badrun2.txt

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
> Attachments: KillMultipleTasks-badrun.txt, 
> ROOT_IsolatorFlags-badrun.txt, ROOT_IsolatorFlags-badrun2.txt, 
> ReconcileTasksMissingFromSlave-badrun.txt, ResourceLimitation-badrun.txt, 
> ResourceLimitation-badrun2.txt, 
> RestartSlaveRequireExecutorAuthentication-badrun.txt, 
> TaskWithFileURI-badrun.txt
>
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}
> All currently affected tests:
> {noformat}
> SlaveTest.RestartSlaveRequireExecutorAuthentication
> LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8419) RP manager incorrectly setting framework ID leads to CHECK failure

2018-01-09 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318132#comment-16318132
 ] 

Greg Mann commented on MESOS-8419:
--

Review here: https://reviews.apache.org/r/65034/

> RP manager incorrectly setting framework ID leads to CHECK failure
> --
>
> Key: MESOS-8419
> URL: https://issues.apache.org/jira/browse/MESOS-8419
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Greg Mann
>Priority: Blocker
>  Labels: mesosphere
>
> The resource provider manager [unconditionally sets the framework 
> ID|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/resource_provider/manager.cpp#L637]
>  when forwarding operation status updates to the agent. This is incorrect, 
> for example, when the resource provider [generates OPERATION_DROPPED updates 
> during 
> reconciliation|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/resource_provider/storage/provider.cpp#L1653-L1657],
>  and leads to protobuf errors in this case since the framework ID's required 
> {{value}} field is left unset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8419) RP manager incorrectly setting framework ID leads to CHECK failure

2018-01-09 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-8419:
-
  Sprint: Mesosphere Sprint 72
Story Points: 1
  Labels: mesosphere  (was: )
Priority: Blocker  (was: Major)

> RP manager incorrectly setting framework ID leads to CHECK failure
> --
>
> Key: MESOS-8419
> URL: https://issues.apache.org/jira/browse/MESOS-8419
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Greg Mann
>Priority: Blocker
>  Labels: mesosphere
>
> The resource provider manager [unconditionally sets the framework 
> ID|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/resource_provider/manager.cpp#L637]
>  when forwarding operation status updates to the agent. This is incorrect, 
> for example, when the resource provider [generates OPERATION_DROPPED updates 
> during 
> reconciliation|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/resource_provider/storage/provider.cpp#L1653-L1657],
>  and leads to protobuf errors in this case since the framework ID's required 
> {{value}} field is left unset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8419) RP manager incorrectly setting framework ID leads to CHECK failure

2018-01-09 Thread Greg Mann (JIRA)
Greg Mann created MESOS-8419:


 Summary: RP manager incorrectly setting framework ID leads to 
CHECK failure
 Key: MESOS-8419
 URL: https://issues.apache.org/jira/browse/MESOS-8419
 Project: Mesos
  Issue Type: Bug
  Components: agent
Reporter: Greg Mann


The resource provider manager [unconditionally sets the framework 
ID|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/resource_provider/manager.cpp#L637]
 when forwarding operation status updates to the agent. This is incorrect, for 
example, when the resource provider [generates OPERATION_DROPPED updates during 
reconciliation|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/resource_provider/storage/provider.cpp#L1653-L1657],
 and leads to protobuf errors in this case since the framework ID's required 
{{value}} field is left unset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8418) mesos-agent high cpu usage because of numerous /proc/mounts reads

2018-01-09 Thread JIRA
Stéphane Cottin created MESOS-8418:
--

 Summary: mesos-agent high cpu usage because of numerous 
/proc/mounts reads
 Key: MESOS-8418
 URL: https://issues.apache.org/jira/browse/MESOS-8418
 Project: Mesos
  Issue Type: Improvement
  Components: agent, cgroups
Reporter: Stéphane Cottin


/proc/mounts is read many, many times from 
src/(linux/fs|linux/cgroups|slave/slave).cpp.

When using overlayfs, the /proc/mounts contents can become quite large. 
As an example, one of our Q/A single node running ~150 tasks,  have a 361 
lines/ 201299 chars  /proc/mounts file.

This 200kB file is read on this node about 25 to 150 times per second. This is 
a (huge) waste of cpu and I/O time.

Most of these calls are related to cgroups.

Please consider these proposals :

1/ Is /proc/mounts mandatory for cgroups ? 
We already have cgroup subsystems list from /proc/cgroups.
The only compelling information from /proc/mounts seems to be the root mount 
point, 
/sys/fs/cgroup/, which could be obtained by a unique read on agent start.

2/ use /proc/self/mountstats

{noformat}
wc /proc/self/mounts /proc/self/mountstats
361 2166 201299 /proc/self/mounts
361 2888 50200 /proc/self/mountstats
{noformat}

{noformat}
grep cgroup /proc/self/mounts
cgroup /sys/fs/cgroup tmpfs rw,relatime,mode=755 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/cpu cgroup rw,relatime,cpu 0 0
cgroup /sys/fs/cgroup/cpuacct cgroup rw,relatime,cpuacct 0 0
cgroup /sys/fs/cgroup/blkio cgroup rw,relatime,blkio 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,relatime,memory 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,relatime,devices 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,relatime,freezer 0 0
cgroup /sys/fs/cgroup/net_cls cgroup rw,relatime,net_cls 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/net_prio cgroup rw,relatime,net_prio 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,relatime,pids 0 0
{noformat}

{noformat}
grep cgroup /proc/self/mountstats
device cgroup mounted on /sys/fs/cgroup with fstype tmpfs
device cgroup mounted on /sys/fs/cgroup/cpuset with fstype cgroup
device cgroup mounted on /sys/fs/cgroup/cpu with fstype cgroup
device cgroup mounted on /sys/fs/cgroup/cpuacct with fstype cgroup
device cgroup mounted on /sys/fs/cgroup/blkio with fstype cgroup
device cgroup mounted on /sys/fs/cgroup/memory with fstype cgroup
device cgroup mounted on /sys/fs/cgroup/devices with fstype cgroup
device cgroup mounted on /sys/fs/cgroup/freezer with fstype cgroup
device cgroup mounted on /sys/fs/cgroup/net_cls with fstype cgroup
device cgroup mounted on /sys/fs/cgroup/perf_event with fstype cgroup
device cgroup mounted on /sys/fs/cgroup/net_prio with fstype cgroup
device cgroup mounted on /sys/fs/cgroup/pids with fstype cgroup
{noformat}

This file contains all the required information, and is 4x smaller

3/ microcaching
Caching cgroups data for just 1 second would be a huge perfomance improvement, 
but i'm not aware of the possible side effects.









--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8373) Test reconciliation after operation is dropped en route to agent

2018-01-09 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318110#comment-16318110
 ] 

Greg Mann commented on MESOS-8373:
--

Review here: https://reviews.apache.org/r/65039/

> Test reconciliation after operation is dropped en route to agent
> 
>
> Key: MESOS-8373
> URL: https://issues.apache.org/jira/browse/MESOS-8373
> Project: Mesos
>  Issue Type: Task
>  Components: agent, master
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: mesosphere
>
> Since new code paths were added to handle operations on resources in 1.5, we 
> should test that such operations are reconciled correctly after an operation 
> is dropped on the way from the master to the agent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)