[jira] [Commented] (MESOS-8125) Agent should properly handle recovering an executor when its pid is reused
[ https://issues.apache.org/jira/browse/MESOS-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319818#comment-16319818 ] Yan Xu commented on MESOS-8125: --- We used to not need to handle recovering executors after a reboot because the agent would have been considered lost, so not only did we not to need recover the executors, we also didn't need to resume unacknowledged status updates etc. In the new scenario we need to handle these so we cannot just simply remove the {{latest}} executor run symlink. I guess we should just short circuit the executor reconnect/reregister logic based on the {{rebooted}} field in the top-level {{State}} but keep the rest of the recovery logic. > Agent should properly handle recovering an executor when its pid is reused > -- > > Key: MESOS-8125 > URL: https://issues.apache.org/jira/browse/MESOS-8125 > Project: Mesos > Issue Type: Bug >Reporter: Gastón Kleiman >Assignee: Megha Sharma >Priority: Critical > > We know that all executors will be gone once the host on which an agent is > running is rebooted, so there's no need to try to recover these executors. > Trying to recover stopped executors can lead to problems if another process > is assigned the same pid that the executor had before the reboot. In this > case the agent will unsuccessfully try to reregister with the executor, and > then transition it to a {{TERMINATING}} state. The executor will sadly get > stuck in that state, and the tasks that it started will get stuck in whatever > state they were in at the time of the reboot. > One way of getting rid of stuck executors is to remove the {{latest}} symlink > under {{work_dir/meta/slaves/latest/frameworks/ id>/executors//runs}. > Here's how to reproduce this issue: > # Start a task using the Docker containerizer (the same will probably happen > with the command executor). > # Stop the corresponding Mesos agent while the task is running. > # Change the executor's checkpointed forked pid, which is located in the meta > directory, e.g., > {{/var/lib/mesos/slave/meta/slaves/latest/frameworks/19faf6e0-3917-48ab-8b8e-97ec4f9ed41e-0001/executors/foo.13faee90-b5f0-11e7-8032-e607d2b4348c/runs/latest/pids/forked.pid}}. > I used pid 2, which is normally used by {{kthreadd}}. > # Reboot the host -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8413) Zookeeper configuration passwords are shown in clear text
[ https://issues.apache.org/jira/browse/MESOS-8413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319625#comment-16319625 ] James Peach commented on MESOS-8413: There's a similar issue with URLs for the {{CommandInfo.URI}} message. IIRC when I looked into that, the problem was that there was no code to crack the credentials out of the URL, so it wasn't even clear that the URL credentials didn't just happen to work by accident. These passwords end up in log files. > Zookeeper configuration passwords are shown in clear text > - > > Key: MESOS-8413 > URL: https://issues.apache.org/jira/browse/MESOS-8413 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.4.1 >Reporter: Alexander Rojas >Assignee: Alexander Rojas > Labels: mesosphere, security > > No matter how one configures mesos, either by passing the ZooKeeper flags in > the command line or using a file, as follows: > {noformat} > ./bin/mesos-master.sh --work_dir=/tmp/$USER/mesos/master > --log_dir=/tmp/$USER/mesos/master/log > --zk=zk://${zk_username}:${zk_password}@${zk_addr}/mesos --quorum=1 > {noformat} > {noformat} > echo "zk://${zk_username}:${zk_password}@${zk_addr}/mesos" > > /tmp/${USER}/mesos/zk_config.txt > ./bin/mesos-master.sh --work_dir=/tmp/$USER/mesos/master > --log_dir=/tmp/$USER/mesos/master/log --zk=/tmp/${USER}/mesos/zk_config.txt > {noformat} > both the logs and the results of the {{/flags}} endpoint will resolve to the > contents of the flags, i.e.: > {noformat} > I0108 10:12:50.387522 28579 master.cpp:458] Flags at startup: > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate_agents="false" --authenticate_frameworks="false" > --authenticate_http_frameworks="false" --authenticate_http_readonly="false" > --authenticate_http_readwrite="false" --authenticators="crammd5" > --authorizers="local" --filter_gpu_resources="true" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --http_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --log_dir="/tmp/user/mesos/master/log" --logbufsecs="0" > --logging_level="INFO" --max_agent_ping_timeouts="5" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" > --quorum="1" --recovery_agent_removal_limit="100%" > --registry="replicated_log" --registry_fetch_timeout="1mins" > --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" > --registry_max_agent_count="102400" --registry_store_timeout="20secs" > --registry_strict="false" --require_agent_domain="false" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/home/user/mesos/build/../src/webui" > --work_dir="/tmp/user/mesos/master" > --zk="zk://user@passwd:127.0.0.1:2181/mesos" --zk_session_timeout="10secs" > {noformat} > {noformat} > HTTP/1.1 200 OK > Content-Encoding: gzip > Content-Length: 591 > Content-Type: application/json > Date: Mon, 08 Jan 2018 15:12:53 GMT > { > "flags": { > "agent_ping_timeout": "15secs", > "agent_reregister_timeout": "10mins", > "allocation_interval": "1secs", > "allocator": "HierarchicalDRF", > "authenticate_agents": "false", > "authenticate_frameworks": "false", > "authenticate_http_frameworks": "false", > "authenticate_http_readonly": "false", > "authenticate_http_readwrite": "false", > "authenticators": "crammd5", > "authorizers": "local", > "filter_gpu_resources": "true", > "framework_sorter": "drf", > "help": "false", > "hostname_lookup": "true", > "http_authenticators": "basic", > "initialize_driver_logging": "true", > "log_auto_initialize": "true", > "log_dir": "/tmp/user/mesos/master/log", > "logbufsecs": "0", > "logging_level": "INFO", > "max_agent_ping_timeouts": "5", > "max_completed_frameworks": "50", > "max_completed_tasks_per_framework": "1000", > "max_unreachable_tasks_per_framework": "1000", > "port": "5050", > "quiet": "false", > "quorum": "1", > "recovery_agent_removal_limit": "100%", > "registry": "replicated_log", > "registry_fetch_timeout": "1mins", > "registry_gc_interval": "15mins", > "registry_max_agent_age": "2weeks", > "registry_max_agent_count": "102400", > "registry_store_timeout": "20secs", > "registry_strict": "false", > "require_agent_domain": "false", > "root_submissions": "true", > "user_sorter": "drf", >
[jira] [Updated] (MESOS-8422) Master's UpdateSlave handler not correctly updating terminated operations
[ https://issues.apache.org/jira/browse/MESOS-8422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gastón Kleiman updated MESOS-8422: -- Summary: Master's UpdateSlave handler not correctly updating terminated operations (was: Master's UpdateSlave handler not correctly updating operations) > Master's UpdateSlave handler not correctly updating terminated operations > - > > Key: MESOS-8422 > URL: https://issues.apache.org/jira/browse/MESOS-8422 > Project: Mesos > Issue Type: Bug >Reporter: Gastón Kleiman > Labels: mesosphere > > I created a test that verifies that operation status updates are resent to > the master after being dropped en route to it (MESOS-8420). > The test does the following: > # Creates a volume from a RAW disk resource. > # Drops the first `UpdateOperationStatusMessage` message from the agent to > the master, so that it isn't acknowledged by the master. > # Restarts the agent. > # Verifies that the agent resends the operation status update. > The good news are that the agent is resending the operation status update, > the bad news are that it triggers a CHECK failure that crashes the master. > Here are the relevant sections of the log produced by the test: > {noformat} > [ RUN ] > StorageLocalResourceProviderTest.ROOT_RetryOperationStatusUpdateAfterRecovery > [...] > I0109 16:36:08.515882 24106 master.cpp:4284] Processing ACCEPT call for > offers: [ 046b3f21-6e97-4a56-9a13-773f7d481efd-O0 ] on agent > 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 > (core-dev) for framework 046b3f21-6e97-4a56-9a13-773f7d481efd- (default) > at scheduler-2a48a684-64b4-4b4d-a396-6491adb4f2b1@10.0.49.2:40681 > I0109 16:36:08.516487 24106 master.cpp:5260] Processing CREATE_VOLUME > operation with source disk(allocated: storage)(reservations: > [(DYNAMIC,storage)])[RAW(,volume-default)]:4096 from framework > 046b3f21-6e97-4a56-9a13-773f7d481efd- (default) at > scheduler-2a48a684-64b4-4b4d-a396-6491adb4f2b1@10.0.49.2:40681 to agent > 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 (core-dev) > I0109 16:36:08.518704 24106 master.cpp:10622] Sending operation '' (uuid: > 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) to agent > 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 (core-dev) > I0109 16:36:08.521210 24130 provider.cpp:504] Received APPLY_OPERATION event > I0109 16:36:08.521276 24130 provider.cpp:1368] Received CREATE_VOLUME > operation '' (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) > I0109 16:36:08.523131 24432 test_csi_plugin.cpp:305] CreateVolumeRequest > '{"version":{"minor":1},"name":"18b4c4a5-d162-4dcf-bb21-a13c6ee0f408","capacityRange":{"requiredBytes":"4294967296","limitBytes":"4294967296"},"volumeCapabilities":[{"mount":{},"accessMode":{"mode":"SINGLE_NODE_WRITER"}}]}' > I0109 16:36:08.525806 24152 provider.cpp:2635] Applying conversion from > 'disk(allocated: storage)(reservations: > [(DYNAMIC,storage)])[RAW(,volume-default)]:4096' to 'disk(allocated: > storage)(reservations: > [(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096' > for operation (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) > I0109 16:36:08.528725 24134 status_update_manager_process.hpp:152] Received > operation status update OPERATION_FINISHED (Status UUID: > 0c79cdf2-b89d-453b-bb62-57766e968dd0) for operation UUID > 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408 of framework > '046b3f21-6e97-4a56-9a13-773f7d481efd-' on agent > 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 > I0109 16:36:08.529207 24134 status_update_manager_process.hpp:929] > Checkpointing UPDATE for operation status update OPERATION_FINISHED (Status > UUID: 0c79cdf2-b89d-453b-bb62-57766e968dd0) for operation UUID > 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408 of framework > '046b3f21-6e97-4a56-9a13-773f7d481efd-' on agent > 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 > I0109 16:36:08.573177 24150 http.cpp:1185] HTTP POST for > /slave(2)/api/v1/resource_provider from 10.0.49.2:53598 > I0109 16:36:08.573974 24139 slave.cpp:7065] Handling resource provider > message 'UPDATE_OPERATION_STATUS: (uuid: > 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for framework > 046b3f21-6e97-4a56-9a13-773f7d481efd- (latest state: OPERATION_FINISHED, > status update state: OPERATION_FINISHED)' > I0109 16:36:08.574154 24139 slave.cpp:7409] Updating the state of operation ' > with no ID (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for framework > 046b3f21-6e97-4a56-9a13-773f7d481efd- (latest state: OPERATION_FINISHED, > status update state: OPERATION_FINISHED) > I0109 16:36:08.574785 24139 slave.cpp:7249] Forwarding status update of > operation with no ID
[jira] [Updated] (MESOS-8422) Master's UpdateSlave handler not correctly updating operations
[ https://issues.apache.org/jira/browse/MESOS-8422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gastón Kleiman updated MESOS-8422: -- Description: I created a test that verifies that operation status updates are resent to the master after being dropped en route to it (MESOS-8420). The test does the following: # Creates a volume from a RAW disk resource. # Drops the first `UpdateOperationStatusMessage` message from the agent to the master, so that it isn't acknowledged by the master. # Restarts the agent. # Verifies that the agent resends the operation status update. The good news are that the agent is resending the operation status update, the bad news are that it triggers a CHECK failure that crashes the master. Here are the relevant sections of the log produced by the test: {noformat} [ RUN ] StorageLocalResourceProviderTest.ROOT_RetryOperationStatusUpdateAfterRecovery [...] I0109 16:36:08.515882 24106 master.cpp:4284] Processing ACCEPT call for offers: [ 046b3f21-6e97-4a56-9a13-773f7d481efd-O0 ] on agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 (core-dev) for framework 046b3f21-6e97-4a56-9a13-773f7d481efd- (default) at scheduler-2a48a684-64b4-4b4d-a396-6491adb4f2b1@10.0.49.2:40681 I0109 16:36:08.516487 24106 master.cpp:5260] Processing CREATE_VOLUME operation with source disk(allocated: storage)(reservations: [(DYNAMIC,storage)])[RAW(,volume-default)]:4096 from framework 046b3f21-6e97-4a56-9a13-773f7d481efd- (default) at scheduler-2a48a684-64b4-4b4d-a396-6491adb4f2b1@10.0.49.2:40681 to agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 (core-dev) I0109 16:36:08.518704 24106 master.cpp:10622] Sending operation '' (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) to agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 (core-dev) I0109 16:36:08.521210 24130 provider.cpp:504] Received APPLY_OPERATION event I0109 16:36:08.521276 24130 provider.cpp:1368] Received CREATE_VOLUME operation '' (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) I0109 16:36:08.523131 24432 test_csi_plugin.cpp:305] CreateVolumeRequest '{"version":{"minor":1},"name":"18b4c4a5-d162-4dcf-bb21-a13c6ee0f408","capacityRange":{"requiredBytes":"4294967296","limitBytes":"4294967296"},"volumeCapabilities":[{"mount":{},"accessMode":{"mode":"SINGLE_NODE_WRITER"}}]}' I0109 16:36:08.525806 24152 provider.cpp:2635] Applying conversion from 'disk(allocated: storage)(reservations: [(DYNAMIC,storage)])[RAW(,volume-default)]:4096' to 'disk(allocated: storage)(reservations: [(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096' for operation (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) I0109 16:36:08.528725 24134 status_update_manager_process.hpp:152] Received operation status update OPERATION_FINISHED (Status UUID: 0c79cdf2-b89d-453b-bb62-57766e968dd0) for operation UUID 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408 of framework '046b3f21-6e97-4a56-9a13-773f7d481efd-' on agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 I0109 16:36:08.529207 24134 status_update_manager_process.hpp:929] Checkpointing UPDATE for operation status update OPERATION_FINISHED (Status UUID: 0c79cdf2-b89d-453b-bb62-57766e968dd0) for operation UUID 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408 of framework '046b3f21-6e97-4a56-9a13-773f7d481efd-' on agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 I0109 16:36:08.573177 24150 http.cpp:1185] HTTP POST for /slave(2)/api/v1/resource_provider from 10.0.49.2:53598 I0109 16:36:08.573974 24139 slave.cpp:7065] Handling resource provider message 'UPDATE_OPERATION_STATUS: (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for framework 046b3f21-6e97-4a56-9a13-773f7d481efd- (latest state: OPERATION_FINISHED, status update state: OPERATION_FINISHED)' I0109 16:36:08.574154 24139 slave.cpp:7409] Updating the state of operation ' with no ID (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for framework 046b3f21-6e97-4a56-9a13-773f7d481efd- (latest state: OPERATION_FINISHED, status update state: OPERATION_FINISHED) I0109 16:36:08.574785 24139 slave.cpp:7249] Forwarding status update of operation with no ID (operation_uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for framework 046b3f21-6e97-4a56-9a13-773f7d481efd- I0109 16:36:08.583748 24084 slave.cpp:931] Agent terminating I0109 16:36:08.584115 24144 master.cpp:1305] Agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 (core-dev) disconnected [...] I0109 16:36:08.655766 24140 slave.cpp:1378] Re-registered with master master@10.0.49.2:40681 I0109 16:36:08.655936 24117 task_status_update_manager.cpp:188] Resuming sending task status updates I0109 16:36:08.655995 24149 hierarchical.cpp:669] Agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 (core-dev) updated with total resources cpus:2;
[jira] [Updated] (MESOS-8422) Master's UpdateSlave handler not correctly updating terminated operations
[ https://issues.apache.org/jira/browse/MESOS-8422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gastón Kleiman updated MESOS-8422: -- Description: I created a test that verifies that operation status updates are resent to the master after being dropped en route to it (MESOS-8420). The test does the following: # Creates a volume from a RAW disk resource. # Drops the first `UpdateOperationStatusMessage` message from the agent to the master, so that it isn't acknowledged by the master. # Restarts the agent. # Verifies that the agent resends the operation status update. The good news are that the agent is resending the operation status update, the bad news are that it triggers a CHECK failure that crashes the master. Here are the relevant sections of the log produced by the test: {noformat} [ RUN ] StorageLocalResourceProviderTest.ROOT_RetryOperationStatusUpdateAfterRecovery [...] I0109 16:36:08.515882 24106 master.cpp:4284] Processing ACCEPT call for offers: [ 046b3f21-6e97-4a56-9a13-773f7d481efd-O0 ] on agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 (core-dev) for framework 046b3f21-6e97-4a56-9a13-773f7d481efd- (default) at scheduler-2a48a684-64b4-4b4d-a396-6491adb4f2b1@10.0.49.2:40681 I0109 16:36:08.516487 24106 master.cpp:5260] Processing CREATE_VOLUME operation with source disk(allocated: storage)(reservations: [(DYNAMIC,storage)])[RAW(,volume-default)]:4096 from framework 046b3f21-6e97-4a56-9a13-773f7d481efd- (default) at scheduler-2a48a684-64b4-4b4d-a396-6491adb4f2b1@10.0.49.2:40681 to agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 (core-dev) I0109 16:36:08.518704 24106 master.cpp:10622] Sending operation '' (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) to agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 (core-dev) I0109 16:36:08.521210 24130 provider.cpp:504] Received APPLY_OPERATION event I0109 16:36:08.521276 24130 provider.cpp:1368] Received CREATE_VOLUME operation '' (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) I0109 16:36:08.523131 24432 test_csi_plugin.cpp:305] CreateVolumeRequest '{"version":{"minor":1},"name":"18b4c4a5-d162-4dcf-bb21-a13c6ee0f408","capacityRange":{"requiredBytes":"4294967296","limitBytes":"4294967296"},"volumeCapabilities":[{"mount":{},"accessMode":{"mode":"SINGLE_NODE_WRITER"}}]}' I0109 16:36:08.525806 24152 provider.cpp:2635] Applying conversion from 'disk(allocated: storage)(reservations: [(DYNAMIC,storage)])[RAW(,volume-default)]:4096' to 'disk(allocated: storage)(reservations: [(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096' for operation (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) I0109 16:36:08.528725 24134 status_update_manager_process.hpp:152] Received operation status update OPERATION_FINISHED (Status UUID: 0c79cdf2-b89d-453b-bb62-57766e968dd0) for operation UUID 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408 of framework '046b3f21-6e97-4a56-9a13-773f7d481efd-' on agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 I0109 16:36:08.529207 24134 status_update_manager_process.hpp:929] Checkpointing UPDATE for operation status update OPERATION_FINISHED (Status UUID: 0c79cdf2-b89d-453b-bb62-57766e968dd0) for operation UUID 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408 of framework '046b3f21-6e97-4a56-9a13-773f7d481efd-' on agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 I0109 16:36:08.573177 24150 http.cpp:1185] HTTP POST for /slave(2)/api/v1/resource_provider from 10.0.49.2:53598 I0109 16:36:08.573974 24139 slave.cpp:7065] Handling resource provider message 'UPDATE_OPERATION_STATUS: (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for framework 046b3f21-6e97-4a56-9a13-773f7d481efd- (latest state: OPERATION_FINISHED, status update state: OPERATION_FINISHED)' I0109 16:36:08.574154 24139 slave.cpp:7409] Updating the state of operation ' with no ID (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for framework 046b3f21-6e97-4a56-9a13-773f7d481efd- (latest state: OPERATION_FINISHED, status update state: OPERATION_FINISHED) I0109 16:36:08.574785 24139 slave.cpp:7249] Forwarding status update of operation with no ID (operation_uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for framework 046b3f21-6e97-4a56-9a13-773f7d481efd- I0109 16:36:08.583748 24084 slave.cpp:931] Agent terminating I0109 16:36:08.584115 24144 master.cpp:1305] Agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 (core-dev) disconnected [...] I0109 16:36:08.655766 24140 slave.cpp:1378] Re-registered with master master@10.0.49.2:40681 I0109 16:36:08.655936 24117 task_status_update_manager.cpp:188] Resuming sending task status updates I0109 16:36:08.655995 24149 hierarchical.cpp:669] Agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 (core-dev) updated with total resources cpus:2;
[jira] [Created] (MESOS-8422) Master's UpdateSlave handler not correctly updating operations
Gastón Kleiman created MESOS-8422: - Summary: Master's UpdateSlave handler not correctly updating operations Key: MESOS-8422 URL: https://issues.apache.org/jira/browse/MESOS-8422 Project: Mesos Issue Type: Bug Reporter: Gastón Kleiman -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8421) Duration operators drop precision, even when used with integers
Andrew Schwartzmeyer created MESOS-8421: --- Summary: Duration operators drop precision, even when used with integers Key: MESOS-8421 URL: https://issues.apache.org/jira/browse/MESOS-8421 Project: Mesos Issue Type: Improvement Components: stout Reporter: Andrew Schwartzmeyer Priority: Minor The implementation of {{Duration operator*=()}} is as follows: {noformat} Duration& operator*=(double multiplier) { nanos = static_cast(nanos * multiplier); return *this; } {noformat} A similar pattern is implemented for all the operators. This means that, even when multiplying by {{int64_t}} (underlying type of {{nanos}}), we lose precision. While [Review #64729|https://reviews.apache.org/r/64729/] removes the conversion warnings from {{int}} and {{size_t}} to {{double}}, it purposefully does not address fixing the precision of these operators (as that'll be a change in behavior, albeit slight, and should be done for the whole class at once). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8078) Some fields went missing with no replacement in api/v1
[ https://issues.apache.org/jira/browse/MESOS-8078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319427#comment-16319427 ] Greg Mann commented on MESOS-8078: -- There are some similarly missing fields in the agent operator API. I'll follow up with a patch for those shortly. > Some fields went missing with no replacement in api/v1 > -- > > Key: MESOS-8078 > URL: https://issues.apache.org/jira/browse/MESOS-8078 > Project: Mesos > Issue Type: Story > Components: HTTP API >Reporter: Dmitrii Rozhkov >Assignee: Greg Mann >Priority: Critical > Labels: mesosphere > > Hi friends, > These fields are available via the state.json but went missing in the v1 of > the API: > -leader_info- -> available via GET_MASTER which should always return leading > master info > start_time > elected_time > As we're showing them on the Overview page of the DC/OS UI, yet would like > not be using state.json, it would be great to have them somewhere in V1. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8078) Some fields went missing with no replacement in api/v1
[ https://issues.apache.org/jira/browse/MESOS-8078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319425#comment-16319425 ] Greg Mann commented on MESOS-8078: -- Review here: https://reviews.apache.org/r/65056/ > Some fields went missing with no replacement in api/v1 > -- > > Key: MESOS-8078 > URL: https://issues.apache.org/jira/browse/MESOS-8078 > Project: Mesos > Issue Type: Story > Components: HTTP API >Reporter: Dmitrii Rozhkov >Assignee: Greg Mann >Priority: Critical > Labels: mesosphere > > Hi friends, > These fields are available via the state.json but went missing in the v1 of > the API: > -leader_info- -> available via GET_MASTER which should always return leading > master info > start_time > elected_time > As we're showing them on the Overview page of the DC/OS UI, yet would like > not be using state.json, it would be great to have them somewhere in V1. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8413) Zookeeper configuration passwords are shown in clear text
[ https://issues.apache.org/jira/browse/MESOS-8413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann updated MESOS-8413: - Shepherd: Greg Mann > Zookeeper configuration passwords are shown in clear text > - > > Key: MESOS-8413 > URL: https://issues.apache.org/jira/browse/MESOS-8413 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.4.1 >Reporter: Alexander Rojas >Assignee: Alexander Rojas > Labels: mesosphere, security > > No matter how one configures mesos, either by passing the ZooKeeper flags in > the command line or using a file, as follows: > {noformat} > ./bin/mesos-master.sh --work_dir=/tmp/$USER/mesos/master > --log_dir=/tmp/$USER/mesos/master/log > --zk=zk://${zk_username}:${zk_password}@${zk_addr}/mesos --quorum=1 > {noformat} > {noformat} > echo "zk://${zk_username}:${zk_password}@${zk_addr}/mesos" > > /tmp/${USER}/mesos/zk_config.txt > ./bin/mesos-master.sh --work_dir=/tmp/$USER/mesos/master > --log_dir=/tmp/$USER/mesos/master/log --zk=/tmp/${USER}/mesos/zk_config.txt > {noformat} > both the logs and the results of the {{/flags}} endpoint will resolve to the > contents of the flags, i.e.: > {noformat} > I0108 10:12:50.387522 28579 master.cpp:458] Flags at startup: > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate_agents="false" --authenticate_frameworks="false" > --authenticate_http_frameworks="false" --authenticate_http_readonly="false" > --authenticate_http_readwrite="false" --authenticators="crammd5" > --authorizers="local" --filter_gpu_resources="true" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --http_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --log_dir="/tmp/user/mesos/master/log" --logbufsecs="0" > --logging_level="INFO" --max_agent_ping_timeouts="5" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" > --quorum="1" --recovery_agent_removal_limit="100%" > --registry="replicated_log" --registry_fetch_timeout="1mins" > --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" > --registry_max_agent_count="102400" --registry_store_timeout="20secs" > --registry_strict="false" --require_agent_domain="false" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/home/user/mesos/build/../src/webui" > --work_dir="/tmp/user/mesos/master" > --zk="zk://user@passwd:127.0.0.1:2181/mesos" --zk_session_timeout="10secs" > {noformat} > {noformat} > HTTP/1.1 200 OK > Content-Encoding: gzip > Content-Length: 591 > Content-Type: application/json > Date: Mon, 08 Jan 2018 15:12:53 GMT > { > "flags": { > "agent_ping_timeout": "15secs", > "agent_reregister_timeout": "10mins", > "allocation_interval": "1secs", > "allocator": "HierarchicalDRF", > "authenticate_agents": "false", > "authenticate_frameworks": "false", > "authenticate_http_frameworks": "false", > "authenticate_http_readonly": "false", > "authenticate_http_readwrite": "false", > "authenticators": "crammd5", > "authorizers": "local", > "filter_gpu_resources": "true", > "framework_sorter": "drf", > "help": "false", > "hostname_lookup": "true", > "http_authenticators": "basic", > "initialize_driver_logging": "true", > "log_auto_initialize": "true", > "log_dir": "/tmp/user/mesos/master/log", > "logbufsecs": "0", > "logging_level": "INFO", > "max_agent_ping_timeouts": "5", > "max_completed_frameworks": "50", > "max_completed_tasks_per_framework": "1000", > "max_unreachable_tasks_per_framework": "1000", > "port": "5050", > "quiet": "false", > "quorum": "1", > "recovery_agent_removal_limit": "100%", > "registry": "replicated_log", > "registry_fetch_timeout": "1mins", > "registry_gc_interval": "15mins", > "registry_max_agent_age": "2weeks", > "registry_max_agent_count": "102400", > "registry_store_timeout": "20secs", > "registry_strict": "false", > "require_agent_domain": "false", > "root_submissions": "true", > "user_sorter": "drf", > "version": "false", > "webui_dir": "/home/user/mesos/build/../src/webui", > "work_dir": "/tmp/user/mesos/master", > "zk": "zk://user@passwd:127.0.0.1:2181/mesos", > "zk_session_timeout": "10secs" > } > } > {noformat} > Which leads to having no effective way to prevent the passwords to be shown >
[jira] [Updated] (MESOS-8420) Test that operation status updates are retried after being dropped en-route to the master.
[ https://issues.apache.org/jira/browse/MESOS-8420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gastón Kleiman updated MESOS-8420: -- Summary: Test that operation status updates are retried after being dropped en-route to the master. (was: Verify end-to-end operation status update) > Test that operation status updates are retried after being dropped en-route > to the master. > -- > > Key: MESOS-8420 > URL: https://issues.apache.org/jira/browse/MESOS-8420 > Project: Mesos > Issue Type: Task >Reporter: Gastón Kleiman >Assignee: Gastón Kleiman > Labels: mesosphere > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8420) Verify end-to-end operation status update
Gastón Kleiman created MESOS-8420: - Summary: Verify end-to-end operation status update Key: MESOS-8420 URL: https://issues.apache.org/jira/browse/MESOS-8420 Project: Mesos Issue Type: Task Reporter: Gastón Kleiman Assignee: Gastón Kleiman -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7803) fs::list drops path components on Windows
[ https://issues.apache.org/jira/browse/MESOS-7803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Schwartzmeyer updated MESOS-7803: Priority: Major (was: Minor) > fs::list drops path components on Windows > - > > Key: MESOS-7803 > URL: https://issues.apache.org/jira/browse/MESOS-7803 > Project: Mesos > Issue Type: Bug > Components: stout > Environment: Windows 10 >Reporter: Andrew Schwartzmeyer >Assignee: Andrew Schwartzmeyer > Labels: windows > > fs::list(/foo/bar/*.txt) returns a.txt, b.txt, not /foo/bar/a.txt, > /foo/bar/b.txt > This breaks a ZooKeeper test on Windows. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8224) mesos.interface 1.4.0 cannot be installed with pip
[ https://issues.apache.org/jira/browse/MESOS-8224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kapil Arya updated MESOS-8224: -- Story Points: 1 Sprint: Mesosphere Sprint 72 Packages available at: https://pypi.python.org/pypi/mesos.interface/1.4.0 The following succeeds now: {code} python -m pip install --user mesos.interface==1.4.0 {code} > mesos.interface 1.4.0 cannot be installed with pip > -- > > Key: MESOS-8224 > URL: https://issues.apache.org/jira/browse/MESOS-8224 > Project: Mesos > Issue Type: Task > Components: release >Reporter: Bill Farner > > This breaks some framework development tooling. > WIth latest pip: > {noformat} > $ python -m pip -V > pip 9.0.1 from > /Users/wfarner/code/aurora/build-support/python/pycharm.venv/lib/python2.7/site-packages > (python 2.7) > {noformat} > This works fine for previous releases: > {noformat} > $ python -m pip install mesos.interface==1.3.0 > Collecting mesos.interface==1.3.0 > ... > Installing collected packages: mesos.interface > Successfully installed mesos.interface-1.3.0 > {noformat} > But it does not for 1.4.0: > {noformat} > $ python -m pip install mesos.interface==1.4.0 > Collecting mesos.interface==1.4.0 > Could not find a version that satisfies the requirement > mesos.interface==1.4.0 (from versions: 0.21.2.linux-x86_64, > 0.22.1.2.linux-x86_64, 0.22.2.linux-x86_64, 0.23.1.linux-x86_64, > 0.24.1.linux-x86_64, 0.24.2.linux-x86_64, 0.25.0.linux-x86_64, > 0.25.1.linux-x86_64, 0.26.1.linux-x86_64, 0.27.0.linux-x86_64, > 0.27.1.linux-x86_64, 0.27.2.linux-x86_64, 0.28.0.linux-x86_64, > 0.28.1.linux-x86_64, 0.28.2.linux-x86_64, 1.0.0.linux-x86_64, > 1.0.1.linux-x86_64, 1.1.0.linux-x86_64, 1.2.0.linux-x86_64, > 1.3.0.linux-x86_64, 0.20.0, 0.20.1, 0.21.0, 0.21.1, 0.21.2, 0.22.0, 0.22.1.2, > 0.22.2, 0.23.0, 0.23.1, 0.24.0, 0.24.1, 0.24.2, 0.25.0, 0.25.1, 0.26.0, > 0.26.1, 0.27.0, 0.27.1, 0.27.2, 0.28.0, 0.28.1, 0.28.2, 1.0.0, 1.0.1, 1.1.0, > 1.2.0, 1.3.0) > No matching distribution found for mesos.interface==1.4.0 > {noformat} > Verbose output shows that pip skips the 1.4.0 distribution: > {noformat} > $ python -m pip install -v mesos.interface==1.4.0 | grep 1.4.0 > Collecting mesos.interface==1.4.0 > Skipping link > https://pypi.python.org/packages/ef/1b/d5b0c1456f755ad42477eaa9667e22d1f5fd8e2fce0f9b26937f93743f6c/mesos.interface-1.4.0-py2.7.egg#md5=32113860961d49c31f69f7b13a9bc063 > (from https://pypi.python.org/simple/mesos-interface/); unsupported archive > format: .egg > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8078) Some fields went missing with no replacement in api/v1
[ https://issues.apache.org/jira/browse/MESOS-8078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319344#comment-16319344 ] Greg Mann commented on MESOS-8078: -- [~drozhkov], thanks for the ping. I'm working on this issue today, hoping to post a patch by EOD. > Some fields went missing with no replacement in api/v1 > -- > > Key: MESOS-8078 > URL: https://issues.apache.org/jira/browse/MESOS-8078 > Project: Mesos > Issue Type: Story > Components: HTTP API >Reporter: Dmitrii Rozhkov >Assignee: Greg Mann >Priority: Critical > Labels: mesosphere > > Hi friends, > These fields are available via the state.json but went missing in the v1 of > the API: > -leader_info- -> available via GET_MASTER which should always return leading > master info > start_time > elected_time > As we're showing them on the Overview page of the DC/OS UI, yet would like > not be using state.json, it would be great to have them somewhere in V1. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8078) Some fields went missing with no replacement in api/v1
[ https://issues.apache.org/jira/browse/MESOS-8078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann updated MESOS-8078: - Shepherd: Vinod Kone (was: Greg Mann) > Some fields went missing with no replacement in api/v1 > -- > > Key: MESOS-8078 > URL: https://issues.apache.org/jira/browse/MESOS-8078 > Project: Mesos > Issue Type: Story > Components: HTTP API >Reporter: Dmitrii Rozhkov >Assignee: Greg Mann >Priority: Critical > Labels: mesosphere > > Hi friends, > These fields are available via the state.json but went missing in the v1 of > the API: > -leader_info- -> available via GET_MASTER which should always return leading > master info > start_time > elected_time > As we're showing them on the Overview page of the DC/OS UI, yet would like > not be using state.json, it would be great to have them somewhere in V1. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (MESOS-8419) RP manager incorrectly setting framework ID leads to CHECK failure
[ https://issues.apache.org/jira/browse/MESOS-8419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-8419: Assignee: Greg Mann > RP manager incorrectly setting framework ID leads to CHECK failure > -- > > Key: MESOS-8419 > URL: https://issues.apache.org/jira/browse/MESOS-8419 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Greg Mann >Assignee: Greg Mann >Priority: Blocker > Labels: mesosphere > > The resource provider manager [unconditionally sets the framework > ID|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/resource_provider/manager.cpp#L637] > when forwarding operation status updates to the agent. This is incorrect, > for example, when the resource provider [generates OPERATION_DROPPED updates > during > reconciliation|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/resource_provider/storage/provider.cpp#L1653-L1657], > and leads to protobuf errors in this case since the framework ID's required > {{value}} field is left unset. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used
[ https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-8356: -- Fix Version/s: 1.3.2 > Persistent volume ownership is set to root despite of sandbox owner > (frameworkInfo.user) when docker executor is used > - > > Key: MESOS-8356 > URL: https://issues.apache.org/jira/browse/MESOS-8356 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.1.3, 1.2.3, 1.3.1, 1.4.1 > Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13 >Reporter: Konstantin Kalin >Assignee: Jie Yu >Priority: Critical > Labels: persistent-volumes > Fix For: 1.3.2, 1.4.2, 1.5.0 > > > PersistentVolume ownership is not set to match the sandbox user when the > docker executor is used. Looks like the issue was introduced by > https://reviews.apache.org/r/45963/ > I didn't check the universal containerizer yet. > As far as I understand the following code is supposed to check that a volume > is not being already used by other tasks/containers. > src/slave/containerizer/docker.cpp > {code:java} > foreachvalue (const Container* container, containers_) { > if (container->resources.contains(resource)) { > isVolumeInUse = true; > break; > } > } > {code} > But it doesn't exclude a container to be launch (In my case I have only one > container - no group of tasks). Thus the ownership of PersistentVolume stays > "root" (I run mesos-agent under root) and it's impossible to use the volume > inside the container. We always run processes inside Docker containers under > unprivileged user. > Making a small patch to exclude the container to launch fixes the issue. > {code:java} > foreachvalue (const Container* container, containers_) { > if (container->resources.contains(resource) && > containerId != container->id) { > isVolumeInUse = true; > break; > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used
[ https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-8356: -- Target Version/s: 1.3.2, 1.4.2, 1.5.1 (was: 1.4.2, 1.5.1) > Persistent volume ownership is set to root despite of sandbox owner > (frameworkInfo.user) when docker executor is used > - > > Key: MESOS-8356 > URL: https://issues.apache.org/jira/browse/MESOS-8356 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.1.3, 1.2.3, 1.3.1, 1.4.1 > Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13 >Reporter: Konstantin Kalin >Assignee: Jie Yu >Priority: Critical > Labels: persistent-volumes > Fix For: 1.4.2, 1.5.0 > > > PersistentVolume ownership is not set to match the sandbox user when the > docker executor is used. Looks like the issue was introduced by > https://reviews.apache.org/r/45963/ > I didn't check the universal containerizer yet. > As far as I understand the following code is supposed to check that a volume > is not being already used by other tasks/containers. > src/slave/containerizer/docker.cpp > {code:java} > foreachvalue (const Container* container, containers_) { > if (container->resources.contains(resource)) { > isVolumeInUse = true; > break; > } > } > {code} > But it doesn't exclude a container to be launch (In my case I have only one > container - no group of tasks). Thus the ownership of PersistentVolume stays > "root" (I run mesos-agent under root) and it's impossible to use the volume > inside the container. We always run processes inside Docker containers under > unprivileged user. > Making a small patch to exclude the container to launch fixes the issue. > {code:java} > foreachvalue (const Container* container, containers_) { > if (container->resources.contains(resource) && > containerId != container->id) { > isVolumeInUse = true; > break; > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used
[ https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-8356: -- Fix Version/s: 1.4.2 > Persistent volume ownership is set to root despite of sandbox owner > (frameworkInfo.user) when docker executor is used > - > > Key: MESOS-8356 > URL: https://issues.apache.org/jira/browse/MESOS-8356 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.1.3, 1.2.3, 1.3.1, 1.4.1 > Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13 >Reporter: Konstantin Kalin >Assignee: Jie Yu >Priority: Critical > Labels: persistent-volumes > Fix For: 1.4.2, 1.5.0 > > > PersistentVolume ownership is not set to match the sandbox user when the > docker executor is used. Looks like the issue was introduced by > https://reviews.apache.org/r/45963/ > I didn't check the universal containerizer yet. > As far as I understand the following code is supposed to check that a volume > is not being already used by other tasks/containers. > src/slave/containerizer/docker.cpp > {code:java} > foreachvalue (const Container* container, containers_) { > if (container->resources.contains(resource)) { > isVolumeInUse = true; > break; > } > } > {code} > But it doesn't exclude a container to be launch (In my case I have only one > container - no group of tasks). Thus the ownership of PersistentVolume stays > "root" (I run mesos-agent under root) and it's impossible to use the volume > inside the container. We always run processes inside Docker containers under > unprivileged user. > Making a small patch to exclude the container to launch fixes the issue. > {code:java} > foreachvalue (const Container* container, containers_) { > if (container->resources.contains(resource) && > containerId != container->id) { > isVolumeInUse = true; > break; > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used
[ https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319113#comment-16319113 ] Jie Yu commented on MESOS-8356: --- commit c8e6487d251d938c3c221f606f7e924514877655 (origin/master, origin/HEAD, master) Author: Jie YuDate: Tue Jan 9 11:23:20 2018 -0800 Fixed the persistent volume permission issue in DockerContainerizer. This patch fixes MESOS-8356 by skipping the current container to be launched when doing the shared volume check (`isVolumeInUse`). Prior to this patch, the code is buggy because `isVolumeInUse` will always be set to `true`. Review: https://reviews.apache.org/r/65049 > Persistent volume ownership is set to root despite of sandbox owner > (frameworkInfo.user) when docker executor is used > - > > Key: MESOS-8356 > URL: https://issues.apache.org/jira/browse/MESOS-8356 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.1.3, 1.2.3, 1.3.1, 1.4.1 > Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13 >Reporter: Konstantin Kalin >Assignee: Jie Yu >Priority: Critical > Labels: persistent-volumes > Fix For: 1.5.0 > > > PersistentVolume ownership is not set to match the sandbox user when the > docker executor is used. Looks like the issue was introduced by > https://reviews.apache.org/r/45963/ > I didn't check the universal containerizer yet. > As far as I understand the following code is supposed to check that a volume > is not being already used by other tasks/containers. > src/slave/containerizer/docker.cpp > {code:java} > foreachvalue (const Container* container, containers_) { > if (container->resources.contains(resource)) { > isVolumeInUse = true; > break; > } > } > {code} > But it doesn't exclude a container to be launch (In my case I have only one > container - no group of tasks). Thus the ownership of PersistentVolume stays > "root" (I run mesos-agent under root) and it's impossible to use the volume > inside the container. We always run processes inside Docker containers under > unprivileged user. > Making a small patch to exclude the container to launch fixes the issue. > {code:java} > foreachvalue (const Container* container, containers_) { > if (container->resources.contains(resource) && > containerId != container->id) { > isVolumeInUse = true; > break; > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used
[ https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-8356: -- Fix Version/s: 1.5.0 > Persistent volume ownership is set to root despite of sandbox owner > (frameworkInfo.user) when docker executor is used > - > > Key: MESOS-8356 > URL: https://issues.apache.org/jira/browse/MESOS-8356 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.1.3, 1.2.3, 1.3.1, 1.4.1 > Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13 >Reporter: Konstantin Kalin >Assignee: Jie Yu >Priority: Critical > Labels: persistent-volumes > Fix For: 1.5.0 > > > PersistentVolume ownership is not set to match the sandbox user when the > docker executor is used. Looks like the issue was introduced by > https://reviews.apache.org/r/45963/ > I didn't check the universal containerizer yet. > As far as I understand the following code is supposed to check that a volume > is not being already used by other tasks/containers. > src/slave/containerizer/docker.cpp > {code:java} > foreachvalue (const Container* container, containers_) { > if (container->resources.contains(resource)) { > isVolumeInUse = true; > break; > } > } > {code} > But it doesn't exclude a container to be launch (In my case I have only one > container - no group of tasks). Thus the ownership of PersistentVolume stays > "root" (I run mesos-agent under root) and it's impossible to use the volume > inside the container. We always run processes inside Docker containers under > unprivileged user. > Making a small patch to exclude the container to launch fixes the issue. > {code:java} > foreachvalue (const Container* container, containers_) { > if (container->resources.contains(resource) && > containerId != container->id) { > isVolumeInUse = true; > break; > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8078) Some fields went missing with no replacement in api/v1
[ https://issues.apache.org/jira/browse/MESOS-8078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann updated MESOS-8078: - Sprint: Mesosphere Sprint 66, Mesosphere Sprint 67, Mesosphere Sprint 68, Mesosphere Sprint 72 (was: Mesosphere Sprint 66, Mesosphere Sprint 67, Mesosphere Sprint 68) > Some fields went missing with no replacement in api/v1 > -- > > Key: MESOS-8078 > URL: https://issues.apache.org/jira/browse/MESOS-8078 > Project: Mesos > Issue Type: Story > Components: HTTP API >Reporter: Dmitrii Rozhkov >Assignee: Greg Mann >Priority: Critical > Labels: mesosphere > > Hi friends, > These fields are available via the state.json but went missing in the v1 of > the API: > -leader_info- -> available via GET_MASTER which should always return leading > master info > start_time > elected_time > As we're showing them on the Overview page of the DC/OS UI, yet would like > not be using state.json, it would be great to have them somewhere in V1. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (MESOS-8078) Some fields went missing with no replacement in api/v1
[ https://issues.apache.org/jira/browse/MESOS-8078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-8078: Assignee: Greg Mann (was: Vinod Kone) > Some fields went missing with no replacement in api/v1 > -- > > Key: MESOS-8078 > URL: https://issues.apache.org/jira/browse/MESOS-8078 > Project: Mesos > Issue Type: Story > Components: HTTP API >Reporter: Dmitrii Rozhkov >Assignee: Greg Mann >Priority: Critical > Labels: mesosphere > > Hi friends, > These fields are available via the state.json but went missing in the v1 of > the API: > -leader_info- -> available via GET_MASTER which should always return leading > master info > start_time > elected_time > As we're showing them on the Overview page of the DC/OS UI, yet would like > not be using state.json, it would be great to have them somewhere in V1. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used
[ https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318943#comment-16318943 ] Jie Yu commented on MESOS-8356: --- I verified that it's not an issue with Mesos containerizer (aka, universal containerizer), but it's a problem for docker containerizer. > Persistent volume ownership is set to root despite of sandbox owner > (frameworkInfo.user) when docker executor is used > - > > Key: MESOS-8356 > URL: https://issues.apache.org/jira/browse/MESOS-8356 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.1.3, 1.2.3, 1.3.1, 1.4.1 > Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13 >Reporter: Konstantin Kalin >Assignee: Jie Yu >Priority: Critical > Labels: persistent-volumes > > PersistentVolume ownership is not set to match the sandbox user when the > docker executor is used. Looks like the issue was introduced by > https://reviews.apache.org/r/45963/ > I didn't check the universal containerizer yet. > As far as I understand the following code is supposed to check that a volume > is not being already used by other tasks/containers. > src/slave/containerizer/docker.cpp > {code:java} > foreachvalue (const Container* container, containers_) { > if (container->resources.contains(resource)) { > isVolumeInUse = true; > break; > } > } > {code} > But it doesn't exclude a container to be launch (In my case I have only one > container - no group of tasks). Thus the ownership of PersistentVolume stays > "root" (I run mesos-agent under root) and it's impossible to use the volume > inside the container. We always run processes inside Docker containers under > unprivileged user. > Making a small patch to exclude the container to launch fixes the issue. > {code:java} > foreachvalue (const Container* container, containers_) { > if (container->resources.contains(resource) && > containerId != container->id) { > isVolumeInUse = true; > break; > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used
[ https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-8356: -- Affects Version/s: 1.1.3 1.2.3 1.3.1 Target Version/s: 1.4.2, 1.5.1 > Persistent volume ownership is set to root despite of sandbox owner > (frameworkInfo.user) when docker executor is used > - > > Key: MESOS-8356 > URL: https://issues.apache.org/jira/browse/MESOS-8356 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.1.3, 1.2.3, 1.3.1, 1.4.1 > Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13 >Reporter: Konstantin Kalin >Assignee: Jie Yu > Labels: persistent-volumes > > PersistentVolume ownership is not set to match the sandbox user when the > docker executor is used. Looks like the issue was introduced by > https://reviews.apache.org/r/45963/ > I didn't check the universal containerizer yet. > As far as I understand the following code is supposed to check that a volume > is not being already used by other tasks/containers. > src/slave/containerizer/docker.cpp > {code:java} > foreachvalue (const Container* container, containers_) { > if (container->resources.contains(resource)) { > isVolumeInUse = true; > break; > } > } > {code} > But it doesn't exclude a container to be launch (In my case I have only one > container - no group of tasks). Thus the ownership of PersistentVolume stays > "root" (I run mesos-agent under root) and it's impossible to use the volume > inside the container. We always run processes inside Docker containers under > unprivileged user. > Making a small patch to exclude the container to launch fixes the issue. > {code:java} > foreachvalue (const Container* container, containers_) { > if (container->resources.contains(resource) && > containerId != container->id) { > isVolumeInUse = true; > break; > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used
[ https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-8356: -- Priority: Critical (was: Major) > Persistent volume ownership is set to root despite of sandbox owner > (frameworkInfo.user) when docker executor is used > - > > Key: MESOS-8356 > URL: https://issues.apache.org/jira/browse/MESOS-8356 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.1.3, 1.2.3, 1.3.1, 1.4.1 > Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13 >Reporter: Konstantin Kalin >Assignee: Jie Yu >Priority: Critical > Labels: persistent-volumes > > PersistentVolume ownership is not set to match the sandbox user when the > docker executor is used. Looks like the issue was introduced by > https://reviews.apache.org/r/45963/ > I didn't check the universal containerizer yet. > As far as I understand the following code is supposed to check that a volume > is not being already used by other tasks/containers. > src/slave/containerizer/docker.cpp > {code:java} > foreachvalue (const Container* container, containers_) { > if (container->resources.contains(resource)) { > isVolumeInUse = true; > break; > } > } > {code} > But it doesn't exclude a container to be launch (In my case I have only one > container - no group of tasks). Thus the ownership of PersistentVolume stays > "root" (I run mesos-agent under root) and it's impossible to use the volume > inside the container. We always run processes inside Docker containers under > unprivileged user. > Making a small patch to exclude the container to launch fixes the issue. > {code:java} > foreachvalue (const Container* container, containers_) { > if (container->resources.contains(resource) && > containerId != container->id) { > isVolumeInUse = true; > break; > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used
[ https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318907#comment-16318907 ] Jie Yu edited comment on MESOS-8356 at 1/9/18 6:45 PM: --- [~kkalin] Thanks for reporting! [~xujyan] This looks like a bug to me because `current` is always set to empty in Docker containerizer: https://github.com/apache/mesos/blob/1.4.x/src/slave/containerizer/docker.cpp#L625 The logic in the Mesos containerizer (i.e., filesystem/linux isolator) is slightly different as `current` there is set to be `info->resources`, thus not buggy https://github.com/apache/mesos/blob/1.4.x/src/slave/containerizer/mesos/isolators/filesystem/linux.cpp#L644 was (Author: jieyu): [~kkalin] Thanks for reporting! [~xujyan] This looks like a bug to me because `current` is always set to empty in Docker containerizer: https://github.com/apache/mesos/blob/1.4.x/src/slave/containerizer/docker.cpp#L625 The logic in the Mesos containerizer (i.e., filesystem/linux isolator) is slightly different as `current` there is set to be `info->resources`: https://github.com/apache/mesos/blob/1.4.x/src/slave/containerizer/mesos/isolators/filesystem/linux.cpp#L644 > Persistent volume ownership is set to root despite of sandbox owner > (frameworkInfo.user) when docker executor is used > - > > Key: MESOS-8356 > URL: https://issues.apache.org/jira/browse/MESOS-8356 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.1.3, 1.2.3, 1.3.1, 1.4.1 > Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13 >Reporter: Konstantin Kalin >Assignee: Jie Yu >Priority: Critical > Labels: persistent-volumes > > PersistentVolume ownership is not set to match the sandbox user when the > docker executor is used. Looks like the issue was introduced by > https://reviews.apache.org/r/45963/ > I didn't check the universal containerizer yet. > As far as I understand the following code is supposed to check that a volume > is not being already used by other tasks/containers. > src/slave/containerizer/docker.cpp > {code:java} > foreachvalue (const Container* container, containers_) { > if (container->resources.contains(resource)) { > isVolumeInUse = true; > break; > } > } > {code} > But it doesn't exclude a container to be launch (In my case I have only one > container - no group of tasks). Thus the ownership of PersistentVolume stays > "root" (I run mesos-agent under root) and it's impossible to use the volume > inside the container. We always run processes inside Docker containers under > unprivileged user. > Making a small patch to exclude the container to launch fixes the issue. > {code:java} > foreachvalue (const Container* container, containers_) { > if (container->resources.contains(resource) && > containerId != container->id) { > isVolumeInUse = true; > break; > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used
[ https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318907#comment-16318907 ] Jie Yu commented on MESOS-8356: --- [~kkalin] Thanks for reporting! [~xujyan] This looks like a bug to me because `current` is always set to empty in Docker containerizer: https://github.com/apache/mesos/blob/1.4.x/src/slave/containerizer/docker.cpp#L625 The logic in the Mesos containerizer (i.e., filesystem/linux isolator) is slightly different as `current` there is set to be `info->resources`: https://github.com/apache/mesos/blob/1.4.x/src/slave/containerizer/mesos/isolators/filesystem/linux.cpp#L644 > Persistent volume ownership is set to root despite of sandbox owner > (frameworkInfo.user) when docker executor is used > - > > Key: MESOS-8356 > URL: https://issues.apache.org/jira/browse/MESOS-8356 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.4.1 > Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13 >Reporter: Konstantin Kalin >Assignee: Jie Yu > Labels: persistent-volumes > > PersistentVolume ownership is not set to match the sandbox user when the > docker executor is used. Looks like the issue was introduced by > https://reviews.apache.org/r/45963/ > I didn't check the universal containerizer yet. > As far as I understand the following code is supposed to check that a volume > is not being already used by other tasks/containers. > src/slave/containerizer/docker.cpp > {code:java} > foreachvalue (const Container* container, containers_) { > if (container->resources.contains(resource)) { > isVolumeInUse = true; > break; > } > } > {code} > But it doesn't exclude a container to be launch (In my case I have only one > container - no group of tasks). Thus the ownership of PersistentVolume stays > "root" (I run mesos-agent under root) and it's impossible to use the volume > inside the container. We always run processes inside Docker containers under > unprivileged user. > Making a small patch to exclude the container to launch fixes the issue. > {code:java} > foreachvalue (const Container* container, containers_) { > if (container->resources.contains(resource) && > containerId != container->id) { > isVolumeInUse = true; > break; > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (MESOS-8125) Agent should properly handle recovering an executor when its pid is reused
[ https://issues.apache.org/jira/browse/MESOS-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Megha Sharma reassigned MESOS-8125: --- Assignee: Megha Sharma > Agent should properly handle recovering an executor when its pid is reused > -- > > Key: MESOS-8125 > URL: https://issues.apache.org/jira/browse/MESOS-8125 > Project: Mesos > Issue Type: Bug >Reporter: Gastón Kleiman >Assignee: Megha Sharma >Priority: Critical > > We know that all executors will be gone once the host on which an agent is > running is rebooted, so there's no need to try to recover these executors. > Trying to recover stopped executors can lead to problems if another process > is assigned the same pid that the executor had before the reboot. In this > case the agent will unsuccessfully try to reregister with the executor, and > then transition it to a {{TERMINATING}} state. The executor will sadly get > stuck in that state, and the tasks that it started will get stuck in whatever > state they were in at the time of the reboot. > One way of getting rid of stuck executors is to remove the {{latest}} symlink > under {{work_dir/meta/slaves/latest/frameworks/ id>/executors//runs}. > Here's how to reproduce this issue: > # Start a task using the Docker containerizer (the same will probably happen > with the command executor). > # Stop the corresponding Mesos agent while the task is running. > # Change the executor's checkpointed forked pid, which is located in the meta > directory, e.g., > {{/var/lib/mesos/slave/meta/slaves/latest/frameworks/19faf6e0-3917-48ab-8b8e-97ec4f9ed41e-0001/executors/foo.13faee90-b5f0-11e7-8032-e607d2b4348c/runs/latest/pids/forked.pid}}. > I used pid 2, which is normally used by {{kthreadd}}. > # Reboot the host -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8348) Enable function sections in the build.
[ https://issues.apache.org/jira/browse/MESOS-8348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318864#comment-16318864 ] James Peach commented on MESOS-8348: No apparent performance difference with a quick and arbitrary benchmark. *Without GC unused sections:* {noformat} [--] 3 tests from AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test [ RUN ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/0 Starting reregistration for all agents Reregistered 2000 agents with a total of 10 running tasks and 10 completed tasks in 28.812622779secs [ OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/0 (60329 ms) [ RUN ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/1 Starting reregistration for all agents Reregistered 2000 agents with a total of 20 running tasks and 0 completed tasks in 39.378296252secs [ OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/1 (98509 ms) [ RUN ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/2 Starting reregistration for all agents Reregistered 2 agents with a total of 10 running tasks and 0 completed tasks in 45.240454686secs [ OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/2 (80371 ms) [--] 3 tests from AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test (239209 ms total) {noformat} *With GC unused sections:* {noformat} [--] 3 tests from AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test [ RUN ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/0 Starting reregistration for all agents Reregistered 2000 agents with a total of 10 running tasks and 10 completed tasks in 28.751620417secs [ OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/0 (59282 ms) [ RUN ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/1 Starting reregistration for all agents Reregistered 2000 agents with a total of 20 running tasks and 0 completed tasks in 40.010202034secs [ OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/1 (96938 ms) [ RUN ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/2 Starting reregistration for all agents Reregistered 2 agents with a total of 10 running tasks and 0 completed tasks in 44.541095336secs [ OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/2 (79331 ms) [--] 3 tests from AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test (235551 ms total) {noformat} > Enable function sections in the build. > -- > > Key: MESOS-8348 > URL: https://issues.apache.org/jira/browse/MESOS-8348 > Project: Mesos > Issue Type: Bug > Components: build >Reporter: James Peach >Assignee: James Peach > > Enable {{-ffunction-sections}} to improve the ability of the toolchain to > remove unused code. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used
[ https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-8356: -- Target Version/s: (was: 1.4.2, 1.5.1) > Persistent volume ownership is set to root despite of sandbox owner > (frameworkInfo.user) when docker executor is used > - > > Key: MESOS-8356 > URL: https://issues.apache.org/jira/browse/MESOS-8356 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.4.1 > Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13 >Reporter: Konstantin Kalin >Assignee: Jie Yu > Labels: persistent-volumes > > PersistentVolume ownership is not set to match the sandbox user when the > docker executor is used. Looks like the issue was introduced by > https://reviews.apache.org/r/45963/ > I didn't check the universal containerizer yet. > As far as I understand the following code is supposed to check that a volume > is not being already used by other tasks/containers. > src/slave/containerizer/docker.cpp > {code:java} > foreachvalue (const Container* container, containers_) { > if (container->resources.contains(resource)) { > isVolumeInUse = true; > break; > } > } > {code} > But it doesn't exclude a container to be launch (In my case I have only one > container - no group of tasks). Thus the ownership of PersistentVolume stays > "root" (I run mesos-agent under root) and it's impossible to use the volume > inside the container. We always run processes inside Docker containers under > unprivileged user. > Making a small patch to exclude the container to launch fixes the issue. > {code:java} > foreachvalue (const Container* container, containers_) { > if (container->resources.contains(resource) && > containerId != container->id) { > isVolumeInUse = true; > break; > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used
[ https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-8356: -- Target Version/s: 1.4.2, 1.5.1 > Persistent volume ownership is set to root despite of sandbox owner > (frameworkInfo.user) when docker executor is used > - > > Key: MESOS-8356 > URL: https://issues.apache.org/jira/browse/MESOS-8356 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.4.1 > Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13 >Reporter: Konstantin Kalin >Assignee: Jie Yu > Labels: persistent-volumes > > PersistentVolume ownership is not set to match the sandbox user when the > docker executor is used. Looks like the issue was introduced by > https://reviews.apache.org/r/45963/ > I didn't check the universal containerizer yet. > As far as I understand the following code is supposed to check that a volume > is not being already used by other tasks/containers. > src/slave/containerizer/docker.cpp > {code:java} > foreachvalue (const Container* container, containers_) { > if (container->resources.contains(resource)) { > isVolumeInUse = true; > break; > } > } > {code} > But it doesn't exclude a container to be launch (In my case I have only one > container - no group of tasks). Thus the ownership of PersistentVolume stays > "root" (I run mesos-agent under root) and it's impossible to use the volume > inside the container. We always run processes inside Docker containers under > unprivileged user. > Making a small patch to exclude the container to launch fixes the issue. > {code:java} > foreachvalue (const Container* container, containers_) { > if (container->resources.contains(resource) && > containerId != container->id) { > isVolumeInUse = true; > break; > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8356) Persistent volume ownership is set to root despite of sandbox owner (frameworkInfo.user) when docker executor is used
[ https://issues.apache.org/jira/browse/MESOS-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-8356: -- Affects Version/s: 1.4.1 > Persistent volume ownership is set to root despite of sandbox owner > (frameworkInfo.user) when docker executor is used > - > > Key: MESOS-8356 > URL: https://issues.apache.org/jira/browse/MESOS-8356 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.4.1 > Environment: Centos 7, Mesos 1.4.1, Docker Engine 1.13 >Reporter: Konstantin Kalin >Assignee: Jie Yu > Labels: persistent-volumes > > PersistentVolume ownership is not set to match the sandbox user when the > docker executor is used. Looks like the issue was introduced by > https://reviews.apache.org/r/45963/ > I didn't check the universal containerizer yet. > As far as I understand the following code is supposed to check that a volume > is not being already used by other tasks/containers. > src/slave/containerizer/docker.cpp > {code:java} > foreachvalue (const Container* container, containers_) { > if (container->resources.contains(resource)) { > isVolumeInUse = true; > break; > } > } > {code} > But it doesn't exclude a container to be launch (In my case I have only one > container - no group of tasks). Thus the ownership of PersistentVolume stays > "root" (I run mesos-agent under root) and it's impossible to use the volume > inside the container. We always run processes inside Docker containers under > unprivileged user. > Making a small patch to exclude the container to launch fixes the issue. > {code:java} > foreachvalue (const Container* container, containers_) { > if (container->resources.contains(resource) && > containerId != container->id) { > isVolumeInUse = true; > break; > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-6595) As a Mesos user I want to launch processes that will run on every node in the cluster
[ https://issues.apache.org/jira/browse/MESOS-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318802#comment-16318802 ] Sampsa Tuokko commented on MESOS-6595: -- Any progress at all on this? DaemonSets in Kubernetes world are a heavily used feature, this would be immensely useful from operations perspective. > As a Mesos user I want to launch processes that will run on every node in the > cluster > - > > Key: MESOS-6595 > URL: https://issues.apache.org/jira/browse/MESOS-6595 > Project: Mesos > Issue Type: Story >Reporter: James DeFelice > Labels: mesosphere > > Some applicable use cases: > - log collection > - metrics and monitoring > - service discovery > It might also be useful to break this functionality down into: daemon > processes for master nodes vs. daemon processes for agent nodes. > There was some initial discussion and back-of-the-napkin design for this at > Mesoscon this past year (with an emphasis on agent nodes) but I'm not aware > that anything significant materialized from that. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7854) Authorize resource calls to provider manager api
[ https://issues.apache.org/jira/browse/MESOS-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-7854: Description: The resource provider manager provides a function {code} process::Future api( const process::http::Request& request, const Option& principal) const; {code} which is exposed e.g., as an agent endpoint. We need to add authorization to this function in order to e.g., stop rough callers. was: The resource provider manager provides a function {code} process::Future api( const process::http::Request& request, const Option& principal) const; {code} which is expose e.g., as an agent endpoint. We need to add authorization to this function in order to e.g., stop rough callers. > Authorize resource calls to provider manager api > > > Key: MESOS-7854 > URL: https://issues.apache.org/jira/browse/MESOS-7854 > Project: Mesos > Issue Type: Improvement >Reporter: Benjamin Bannier >Priority: Critical > Labels: csi-post-mvp, mesosphere, storage > > The resource provider manager provides a function > {code} > process::Future api( > const process::http::Request& request, > const Option& principal) const; > {code} > which is exposed e.g., as an agent endpoint. > We need to add authorization to this function in order to e.g., stop rough > callers. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7558) Add resource provider validation
[ https://issues.apache.org/jira/browse/MESOS-7558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-7558: Story Points: 2 > Add resource provider validation > > > Key: MESOS-7558 > URL: https://issues.apache.org/jira/browse/MESOS-7558 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Jan Schlicht > Labels: mesosphere, storage > > Similar to how it's done during agent registration/re-registration, the > informations provided by a resource provider need to get validation during > certain operation (e.g. re-registration, while applying offer operations, > ...). > Some of these validations only cover the provided informations (e.g. are the > resources in {{ResourceProviderInfo}} only of type {{disk}}), others take the > current cluster state into account (e.g. do the resources that a task wants > to use exist on the resource provider). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7558) Add resource provider validation
[ https://issues.apache.org/jira/browse/MESOS-7558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-7558: Story Points: 3 (was: 2) > Add resource provider validation > > > Key: MESOS-7558 > URL: https://issues.apache.org/jira/browse/MESOS-7558 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Jan Schlicht > Labels: mesosphere, storage > > Similar to how it's done during agent registration/re-registration, the > informations provided by a resource provider need to get validation during > certain operation (e.g. re-registration, while applying offer operations, > ...). > Some of these validations only cover the provided informations (e.g. are the > resources in {{ResourceProviderInfo}} only of type {{disk}}), others take the > current cluster state into account (e.g. do the resources that a task wants > to use exist on the resource provider). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7329) Authorize offer operations for converting disk resources
[ https://issues.apache.org/jira/browse/MESOS-7329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-7329: Story Points: 3 > Authorize offer operations for converting disk resources > > > Key: MESOS-7329 > URL: https://issues.apache.org/jira/browse/MESOS-7329 > Project: Mesos > Issue Type: Task > Components: master, security >Reporter: Jan Schlicht > Labels: csi-post-mvp, mesosphere, security, storage > > All offer operations are authorized, hence authorization logic has to be > added to new offer operations as well. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7742) ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky
[ https://issues.apache.org/jira/browse/MESOS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318666#comment-16318666 ] Andrei Budnik commented on MESOS-7742: -- As we have launched [`cat`|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/tests/api_tests.cpp#L6529] command as a nested container, related ioswitchboard process will be in the same process group. Whenever a process group leader ({{cat}}) terminates, all processes in the process group are killed, including ioswitchboard. ioswitchboard handles HTTP requests from the slave, e.g. {{ATTACH_CONTAINER_INPUT}} request in this test. Usually, after reading all client's data, {{Http::_attachContainerInput()}} invokes a callback which calls [writer.close()|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/slave/http.cpp#L3223]. [writer.close()|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L561] implies sending a [\r\n\r\n|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L1045] to the ioswitchboard process. ioswitchboard returns [200 OK|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/slave/containerizer/mesos/io/switchboard.cpp#L1572] response, hence agent returns {{200 OK}} for {{ATTACH_CONTAINER_INPUT}} request as expected. However, if ioswitchboard terminates before it receives {{\r\n\r\n}} or before agent receives {{200 OK}} response from the ioswitchboard, connection (via unix socket) might be closed, so corresponding {{ConnectionProcess}} will handle this case as an unexpected [EOF| https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L1293 https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L1293] during [read|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L1216] of a response. That will lead to {{500 Internal Server Error}} response from the agent. > ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky > -- > > Key: MESOS-7742 > URL: https://issues.apache.org/jira/browse/MESOS-7742 > Project: Mesos > Issue Type: Bug >Reporter: Vinod Kone >Assignee: Andrei Budnik > Labels: flaky-test, mesosphere-oncall > Attachments: AgentAPITest.LaunchNestedContainerSession-badrun.txt, > LaunchNestedContainerSessionDisconnected-badrun.txt > > > Observed this on ASF CI and internal Mesosphere CI. Affected tests: > {noformat} > AgentAPIStreamingTest.AttachInputToNestedContainerSession > AgentAPITest.LaunchNestedContainerSession > AgentAPITest.AttachContainerInputAuthorization/0 > AgentAPITest.LaunchNestedContainerSessionWithTTY/0 > AgentAPITest.LaunchNestedContainerSessionDisconnected/1 > {noformat} > This issue comes at least in three different flavours. Take > {{AgentAPIStreamingTest.AttachInputToNestedContainerSession}} as an example. > h5. Flavour 1 > {noformat} > ../../src/tests/api_tests.cpp:6473 > Value of: (response).get().status > Actual: "503 Service Unavailable" > Expected: http::OK().status > Which is: "200 OK" > Body: "" > {noformat} > h5. Flavour 2 > {noformat} > ../../src/tests/api_tests.cpp:6473 > Value of: (response).get().status > Actual: "500 Internal Server Error" > Expected: http::OK().status > Which is: "200 OK" > Body: "Disconnected" > {noformat} > h5. Flavour 3 > {noformat} > /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/api_tests.cpp:6367 > Value of: (sessionResponse).get().status > Actual: "500 Internal Server Error" > Expected: http::OK().status > Which is: "200 OK" > Body: "" > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7742) ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky
[ https://issues.apache.org/jira/browse/MESOS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318567#comment-16318567 ] Andrei Budnik commented on MESOS-7742: -- How to reproduce Flavour 3: Put a {{::sleep(1);}} before {{writer.close();}} in [Http::_attachContainerInput()|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/slave/http.cpp#L3222]. > ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky > -- > > Key: MESOS-7742 > URL: https://issues.apache.org/jira/browse/MESOS-7742 > Project: Mesos > Issue Type: Bug >Reporter: Vinod Kone >Assignee: Andrei Budnik > Labels: flaky-test, mesosphere-oncall > Attachments: AgentAPITest.LaunchNestedContainerSession-badrun.txt, > LaunchNestedContainerSessionDisconnected-badrun.txt > > > Observed this on ASF CI and internal Mesosphere CI. Affected tests: > {noformat} > AgentAPIStreamingTest.AttachInputToNestedContainerSession > AgentAPITest.LaunchNestedContainerSession > AgentAPITest.AttachContainerInputAuthorization/0 > AgentAPITest.LaunchNestedContainerSessionWithTTY/0 > AgentAPITest.LaunchNestedContainerSessionDisconnected/1 > {noformat} > This issue comes at least in three different flavours. Take > {{AgentAPIStreamingTest.AttachInputToNestedContainerSession}} as an example. > h5. Flavour 1 > {noformat} > ../../src/tests/api_tests.cpp:6473 > Value of: (response).get().status > Actual: "503 Service Unavailable" > Expected: http::OK().status > Which is: "200 OK" > Body: "" > {noformat} > h5. Flavour 2 > {noformat} > ../../src/tests/api_tests.cpp:6473 > Value of: (response).get().status > Actual: "500 Internal Server Error" > Expected: http::OK().status > Which is: "200 OK" > Body: "Disconnected" > {noformat} > h5. Flavour 3 > {noformat} > /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/api_tests.cpp:6367 > Value of: (sessionResponse).get().status > Actual: "500 Internal Server Error" > Expected: http::OK().status > Which is: "200 OK" > Body: "" > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8388) Show LRP resources in master endpoints.
[ https://issues.apache.org/jira/browse/MESOS-8388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-8388: Story Points: 2 Component/s: master > Show LRP resources in master endpoints. > --- > > Key: MESOS-8388 > URL: https://issues.apache.org/jira/browse/MESOS-8388 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Jie Yu > > Currently, only resource provider info is shown. We should also shown the > resources provided by the RP. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8078) Some fields went missing with no replacement in api/v1
[ https://issues.apache.org/jira/browse/MESOS-8078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318459#comment-16318459 ] Dmitrii Rozhkov commented on MESOS-8078: Hey [~vinodkone], would you please give an update on the issue? It's kind of critical since we're approaching release. Thanks! > Some fields went missing with no replacement in api/v1 > -- > > Key: MESOS-8078 > URL: https://issues.apache.org/jira/browse/MESOS-8078 > Project: Mesos > Issue Type: Story > Components: HTTP API >Reporter: Dmitrii Rozhkov >Assignee: Vinod Kone >Priority: Critical > Labels: mesosphere > > Hi friends, > These fields are available via the state.json but went missing in the v1 of > the API: > -leader_info- -> available via GET_MASTER which should always return leading > master info > start_time > elected_time > As we're showing them on the Overview page of the DC/OS UI, yet would like > not be using state.json, it would be great to have them somewhere in V1. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8078) Some fields went missing with no replacement in api/v1
[ https://issues.apache.org/jira/browse/MESOS-8078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitrii Rozhkov updated MESOS-8078: --- Priority: Critical (was: Major) > Some fields went missing with no replacement in api/v1 > -- > > Key: MESOS-8078 > URL: https://issues.apache.org/jira/browse/MESOS-8078 > Project: Mesos > Issue Type: Story > Components: HTTP API >Reporter: Dmitrii Rozhkov >Assignee: Vinod Kone >Priority: Critical > Labels: mesosphere > > Hi friends, > These fields are available via the state.json but went missing in the v1 of > the API: > -leader_info- -> available via GET_MASTER which should always return leading > master info > start_time > elected_time > As we're showing them on the Overview page of the DC/OS UI, yet would like > not be using state.json, it would be great to have them somewhere in V1. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8382) Master should bookkeep local resource providers.
[ https://issues.apache.org/jira/browse/MESOS-8382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-8382: Story Points: 5 > Master should bookkeep local resource providers. > > > Key: MESOS-8382 > URL: https://issues.apache.org/jira/browse/MESOS-8382 > Project: Mesos > Issue Type: Task >Reporter: Jie Yu >Assignee: Benjamin Bannier > Original Estimate: 5m > Remaining Estimate: 5m > > This will simplify the handling of `UpdateSlaveMessage`. ALso, it'll simplify > the endpoint serving. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7506) Multiple tests leave orphan containers.
[ https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrei Budnik updated MESOS-7506: - Attachment: ROOT_IsolatorFlags-badrun2.txt > Multiple tests leave orphan containers. > --- > > Key: MESOS-7506 > URL: https://issues.apache.org/jira/browse/MESOS-7506 > Project: Mesos > Issue Type: Bug > Components: containerization > Environment: Ubuntu 16.04 > Fedora 23 > other Linux distros >Reporter: Alexander Rukletsov >Assignee: Andrei Budnik > Labels: containerizer, flaky-test, mesosphere > Attachments: KillMultipleTasks-badrun.txt, > ROOT_IsolatorFlags-badrun.txt, ROOT_IsolatorFlags-badrun2.txt, > ReconcileTasksMissingFromSlave-badrun.txt, ResourceLimitation-badrun.txt, > ResourceLimitation-badrun2.txt, > RestartSlaveRequireExecutorAuthentication-badrun.txt, > TaskWithFileURI-badrun.txt > > > I've observed a number of flaky tests that leave orphan containers upon > cleanup. A typical log looks like this: > {noformat} > ../../src/tests/cluster.cpp:580: Failure > Value of: containers->empty() > Actual: false > Expected: true > Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 } > {noformat} > All currently affected tests: > {noformat} > SlaveTest.RestartSlaveRequireExecutorAuthentication > LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8419) RP manager incorrectly setting framework ID leads to CHECK failure
[ https://issues.apache.org/jira/browse/MESOS-8419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318132#comment-16318132 ] Greg Mann commented on MESOS-8419: -- Review here: https://reviews.apache.org/r/65034/ > RP manager incorrectly setting framework ID leads to CHECK failure > -- > > Key: MESOS-8419 > URL: https://issues.apache.org/jira/browse/MESOS-8419 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Greg Mann >Priority: Blocker > Labels: mesosphere > > The resource provider manager [unconditionally sets the framework > ID|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/resource_provider/manager.cpp#L637] > when forwarding operation status updates to the agent. This is incorrect, > for example, when the resource provider [generates OPERATION_DROPPED updates > during > reconciliation|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/resource_provider/storage/provider.cpp#L1653-L1657], > and leads to protobuf errors in this case since the framework ID's required > {{value}} field is left unset. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8419) RP manager incorrectly setting framework ID leads to CHECK failure
[ https://issues.apache.org/jira/browse/MESOS-8419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann updated MESOS-8419: - Sprint: Mesosphere Sprint 72 Story Points: 1 Labels: mesosphere (was: ) Priority: Blocker (was: Major) > RP manager incorrectly setting framework ID leads to CHECK failure > -- > > Key: MESOS-8419 > URL: https://issues.apache.org/jira/browse/MESOS-8419 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Greg Mann >Priority: Blocker > Labels: mesosphere > > The resource provider manager [unconditionally sets the framework > ID|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/resource_provider/manager.cpp#L637] > when forwarding operation status updates to the agent. This is incorrect, > for example, when the resource provider [generates OPERATION_DROPPED updates > during > reconciliation|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/resource_provider/storage/provider.cpp#L1653-L1657], > and leads to protobuf errors in this case since the framework ID's required > {{value}} field is left unset. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8419) RP manager incorrectly setting framework ID leads to CHECK failure
Greg Mann created MESOS-8419: Summary: RP manager incorrectly setting framework ID leads to CHECK failure Key: MESOS-8419 URL: https://issues.apache.org/jira/browse/MESOS-8419 Project: Mesos Issue Type: Bug Components: agent Reporter: Greg Mann The resource provider manager [unconditionally sets the framework ID|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/resource_provider/manager.cpp#L637] when forwarding operation status updates to the agent. This is incorrect, for example, when the resource provider [generates OPERATION_DROPPED updates during reconciliation|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/resource_provider/storage/provider.cpp#L1653-L1657], and leads to protobuf errors in this case since the framework ID's required {{value}} field is left unset. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8418) mesos-agent high cpu usage because of numerous /proc/mounts reads
Stéphane Cottin created MESOS-8418: -- Summary: mesos-agent high cpu usage because of numerous /proc/mounts reads Key: MESOS-8418 URL: https://issues.apache.org/jira/browse/MESOS-8418 Project: Mesos Issue Type: Improvement Components: agent, cgroups Reporter: Stéphane Cottin /proc/mounts is read many, many times from src/(linux/fs|linux/cgroups|slave/slave).cpp. When using overlayfs, the /proc/mounts contents can become quite large. As an example, one of our Q/A single node running ~150 tasks, have a 361 lines/ 201299 chars /proc/mounts file. This 200kB file is read on this node about 25 to 150 times per second. This is a (huge) waste of cpu and I/O time. Most of these calls are related to cgroups. Please consider these proposals : 1/ Is /proc/mounts mandatory for cgroups ? We already have cgroup subsystems list from /proc/cgroups. The only compelling information from /proc/mounts seems to be the root mount point, /sys/fs/cgroup/, which could be obtained by a unique read on agent start. 2/ use /proc/self/mountstats {noformat} wc /proc/self/mounts /proc/self/mountstats 361 2166 201299 /proc/self/mounts 361 2888 50200 /proc/self/mountstats {noformat} {noformat} grep cgroup /proc/self/mounts cgroup /sys/fs/cgroup tmpfs rw,relatime,mode=755 0 0 cgroup /sys/fs/cgroup/cpuset cgroup rw,relatime,cpuset 0 0 cgroup /sys/fs/cgroup/cpu cgroup rw,relatime,cpu 0 0 cgroup /sys/fs/cgroup/cpuacct cgroup rw,relatime,cpuacct 0 0 cgroup /sys/fs/cgroup/blkio cgroup rw,relatime,blkio 0 0 cgroup /sys/fs/cgroup/memory cgroup rw,relatime,memory 0 0 cgroup /sys/fs/cgroup/devices cgroup rw,relatime,devices 0 0 cgroup /sys/fs/cgroup/freezer cgroup rw,relatime,freezer 0 0 cgroup /sys/fs/cgroup/net_cls cgroup rw,relatime,net_cls 0 0 cgroup /sys/fs/cgroup/perf_event cgroup rw,relatime,perf_event 0 0 cgroup /sys/fs/cgroup/net_prio cgroup rw,relatime,net_prio 0 0 cgroup /sys/fs/cgroup/pids cgroup rw,relatime,pids 0 0 {noformat} {noformat} grep cgroup /proc/self/mountstats device cgroup mounted on /sys/fs/cgroup with fstype tmpfs device cgroup mounted on /sys/fs/cgroup/cpuset with fstype cgroup device cgroup mounted on /sys/fs/cgroup/cpu with fstype cgroup device cgroup mounted on /sys/fs/cgroup/cpuacct with fstype cgroup device cgroup mounted on /sys/fs/cgroup/blkio with fstype cgroup device cgroup mounted on /sys/fs/cgroup/memory with fstype cgroup device cgroup mounted on /sys/fs/cgroup/devices with fstype cgroup device cgroup mounted on /sys/fs/cgroup/freezer with fstype cgroup device cgroup mounted on /sys/fs/cgroup/net_cls with fstype cgroup device cgroup mounted on /sys/fs/cgroup/perf_event with fstype cgroup device cgroup mounted on /sys/fs/cgroup/net_prio with fstype cgroup device cgroup mounted on /sys/fs/cgroup/pids with fstype cgroup {noformat} This file contains all the required information, and is 4x smaller 3/ microcaching Caching cgroups data for just 1 second would be a huge perfomance improvement, but i'm not aware of the possible side effects. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8373) Test reconciliation after operation is dropped en route to agent
[ https://issues.apache.org/jira/browse/MESOS-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318110#comment-16318110 ] Greg Mann commented on MESOS-8373: -- Review here: https://reviews.apache.org/r/65039/ > Test reconciliation after operation is dropped en route to agent > > > Key: MESOS-8373 > URL: https://issues.apache.org/jira/browse/MESOS-8373 > Project: Mesos > Issue Type: Task > Components: agent, master >Reporter: Greg Mann >Assignee: Greg Mann > Labels: mesosphere > > Since new code paths were added to handle operations on resources in 1.5, we > should test that such operations are reconciled correctly after an operation > is dropped on the way from the master to the agent. -- This message was sent by Atlassian JIRA (v6.4.14#64029)