[jira] [Commented] (MESOS-9963) URI stringification constructs malformed URIs.
[ https://issues.apache.org/jira/browse/MESOS-9963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933051#comment-16933051 ] James Peach commented on MESOS-9963: Verified that this issue doesn't cause any problems in the current code, because callers are careful to ensure the path component begin with '/' > URI stringification constructs malformed URIs. > -- > > Key: MESOS-9963 > URL: https://issues.apache.org/jira/browse/MESOS-9963 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: James Peach >Assignee: James Peach >Priority: Major > Labels: containerization > > Setting {{docker_registry="https://docker-cache.example.com/}} and then > pulling an image named {{org/image-name:latest}} fails. The Docker image > puller ends up constructing a malformed URL for the manifest: > {noformat} > Pulling image 'org/siri-centos6:stage' from > 'docker-manifest://docker-cache.example.com:443org/image-name?latest#https' > to '/tmp/mesos/store/docker/staging/LGArHA' > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-9963) Docker puller malforms registry URLs.
James Peach created MESOS-9963: -- Summary: Docker puller malforms registry URLs. Key: MESOS-9963 URL: https://issues.apache.org/jira/browse/MESOS-9963 Project: Mesos Issue Type: Improvement Components: containerization Reporter: James Peach Assignee: James Peach Setting {{docker_registry="https://docker-cache.example.com/}} and then pulling an image named {{org/image-name:latest}} fails. The Docker image puller ends up constructing a malformed URL for the manifest: {noformat} Pulling image 'org/siri-centos6:stage' from 'docker-manifest://docker-cache.example.com:443org/image-name?latest#https' to '/tmp/mesos/store/docker/staging/LGArHA' {noformat} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (MESOS-4741) Add role information for static reservation in /master/roles
[ https://issues.apache.org/jira/browse/MESOS-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918205#comment-16918205 ] James Peach commented on MESOS-4741: MESOS-9888 is a duplicate. > Add role information for static reservation in /master/roles > > > Key: MESOS-4741 > URL: https://issues.apache.org/jira/browse/MESOS-4741 > Project: Mesos > Issue Type: Bug > Components: HTTP API >Reporter: Klaus Ma >Priority: Major > > In {{/master/roles}}, it should show static reservation roles if there's no > tasks. > {code} > Klauss-MacBook-Pro:mesos klaus$ curl http://localhost:5050/master/roles.json > | python -m json.tool > % Total% Received % Xferd Average Speed TimeTime Time > Current > Dload Upload Total SpentLeft Speed > 10093 100930 0 13907 0 --:--:-- --:--:-- --:--:-- 15500 > { > "roles": [ > { > "frameworks": [], > "name": "*", > "resources": { > "cpus": 0, > "disk": 0, > "mem": 0 > }, > "weight": 1.0 > } > ] > } > {code} > After submit tasks to r1, it'll show roles. > {code} > Klauss-MacBook-Pro:mesos klaus$ curl http://localhost:5050/master/roles | > python -m json.tool > % Total% Received % Xferd Average Speed TimeTime Time > Current > Dload Upload Total SpentLeft Speed > 100 221 100 2210 0 32721 0 --:--:-- --:--:-- --:--:-- 36833 > { > "roles": [ > { > "frameworks": [], > "name": "*", > "resources": { > "cpus": 0, > "disk": 0, > "mem": 0 > }, > "weight": 1.0 > }, > { > "frameworks": [ > "b4f15a2e-5d9a-4d31-a29e-7737af41c8e4-0002" > ], > "name": "r1", > "resources": { > "cpus": 1.0, > "disk": 0, > "mem": 0 > }, > "weight": 1.0 > } > ] > } > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (MESOS-9935) The agent crashes after the disk du isolator supporting rootfs checks.
[ https://issues.apache.org/jira/browse/MESOS-9935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905794#comment-16905794 ] James Peach commented on MESOS-9935: This reproduces if you run a task without any disk resource. > The agent crashes after the disk du isolator supporting rootfs checks. > -- > > Key: MESOS-9935 > URL: https://issues.apache.org/jira/browse/MESOS-9935 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Gilbert Song >Assignee: James Peach >Priority: Blocker > > This issue was broken by this patch: > https://github.com/apache/mesos/commit/8ba0682521c6051b42f33b3dd96a37f4d46a290d#diff-33089e53bdf9f646cdb9317c212eda02 > A task can be launched without disk resource. However, after this patch, if > the disk resource does not exist, the agent crashes - because the info->paths > only add an entry 'path' when there is a quota and the quota comes from the > disk resource. > {noformat} > Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal > mesos-agent[15492]: F0809 14:54:00.017730 15498 process.cpp:3057] Aborting > libprocess: 'posix-disk-isolator(1)@172.12.2.196:5051' threw exception: > _Map_base::at > Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal > mesos-agent[15492]: *** Check failure stack trace: *** > Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal > mesos-agent[15492]: @ 0x7f65f7d585cd google::LogMessage::Fail() > Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal > mesos-agent[15492]: @ 0x7f65f7d5a828 google::LogMessage::SendToLog() > Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal > mesos-agent[15492]: @ 0x7f65f7d58163 google::LogMessage::Flush() > Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal > mesos-agent[15492]: @ 0x7f65f7d5b169 > google::LogMessageFatal::~LogMessageFatal() > Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal > mesos-agent[15492]: @ 0x7f65f7cb8dbd process::ProcessManager::resume() > Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal > mesos-agent[15492]: @ 0x7f65f7cbe926 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal > mesos-agent[15492]: @ 0x7f65f3976070 (unknown) > Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal > mesos-agent[15492]: @ 0x7f65f3194e25 start_thread > Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal > mesos-agent[15492]: @ 0x7f65f2ebebad __clone > {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (MESOS-9935) The agent crashes after the disk du isolator supporting rootfs checks.
[ https://issues.apache.org/jira/browse/MESOS-9935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-9935: -- Assignee: James Peach > The agent crashes after the disk du isolator supporting rootfs checks. > -- > > Key: MESOS-9935 > URL: https://issues.apache.org/jira/browse/MESOS-9935 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Gilbert Song >Assignee: James Peach >Priority: Blocker > > This issue was broken by this patch: > https://github.com/apache/mesos/commit/8ba0682521c6051b42f33b3dd96a37f4d46a290d#diff-33089e53bdf9f646cdb9317c212eda02 > A task can be launched without disk resource. However, after this patch, if > the disk resource does not exist, the agent crashes - because the info->paths > only add an entry 'path' when there is a quota and the quota comes from the > disk resource. > {noformat} > Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal > mesos-agent[15492]: F0809 14:54:00.017730 15498 process.cpp:3057] Aborting > libprocess: 'posix-disk-isolator(1)@172.12.2.196:5051' threw exception: > _Map_base::at > Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal > mesos-agent[15492]: *** Check failure stack trace: *** > Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal > mesos-agent[15492]: @ 0x7f65f7d585cd google::LogMessage::Fail() > Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal > mesos-agent[15492]: @ 0x7f65f7d5a828 google::LogMessage::SendToLog() > Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal > mesos-agent[15492]: @ 0x7f65f7d58163 google::LogMessage::Flush() > Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal > mesos-agent[15492]: @ 0x7f65f7d5b169 > google::LogMessageFatal::~LogMessageFatal() > Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal > mesos-agent[15492]: @ 0x7f65f7cb8dbd process::ProcessManager::resume() > Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal > mesos-agent[15492]: @ 0x7f65f7cbe926 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal > mesos-agent[15492]: @ 0x7f65f3976070 (unknown) > Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal > mesos-agent[15492]: @ 0x7f65f3194e25 start_thread > Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal > mesos-agent[15492]: @ 0x7f65f2ebebad __clone > {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-9875) Mesos did not respond correctly when operations should fail
[ https://issues.apache.org/jira/browse/MESOS-9875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16902552#comment-16902552 ] James Peach commented on MESOS-9875: {noformat} f9330006-d885-4ef0-b2c7-c9c6fcc239e5 is the persistence ID. 5fa5c810-2dd3-41cb-9633-a3ef404b08c4 is the operation UUID. honvr62494cqk_ff4e953f-0eca-4b41-a08d-ddea27980b14 is the operation ID. I0627 22:03:17.360236 3529210 slave.cpp:4282] Updated checkpointed operations from [ cfd6b624-996f-45d7-9aaf-9a13ab9714b4 (RESERVE for framework efd8f75d-25a9-4346-8c7b-d8c8c95ba328-22525, ID: honvr62494cqk_a5b92fff-5491-4616-8970-8c390265c009, latest state: OPERATION_FINISHED) ] to [ cfd6b624-996f-45d7-9aaf-9a13ab9714b4 (RESERVE for framework efd8f75d-25a9-4346-8c7b-d8c8c95ba328-22525, ID: honvr62494cqk_a5b92fff-5491-4616-8970-8c390265c009, latest state: OPERATION_FINISHED), 5fa5c810-2dd3-41cb-9633-a3ef404b08c4 (CREATE for framework efd8f75d-25a9-4346-8c7b-d8c8c95ba328-22525, ID: honvr62494cqk_ff4e953f-0eca-4b41-a08d-ddea27980b14, latest state: OPERATION_PENDING) ] I0627 22:03:17.360723 3529210 slave.cpp:8670] Updating the state of operation 'honvr62494cqk_ff4e953f-0eca-4b41-a08d-ddea27980b14' (uuid: 5fa5c810-2dd3-41cb-9633-a3ef404b08c4) for framework efd8f75d-25a9-4346-8c7b-d8c8c95ba328-22525 (latest state: OPERATION_FINISHED, status update state: OPERATION_FINISHED) E0627 22:03:17.365811 3529210 slave.cpp:4257] EXIT with status 1: Failed to sync checkpointed resources: Failed to create the persistent volume f9330006-d885-4ef0-b2c7-c9c6fcc239e5 at '/srv/mesos/work/volumes/roles/test-3/f9330006-d885-4ef0-b2c7-c9c6fcc239e5': Operation not permitted {noformat} > Mesos did not respond correctly when operations should fail > --- > > Key: MESOS-9875 > URL: https://issues.apache.org/jira/browse/MESOS-9875 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Yifan Xing >Assignee: Greg Mann >Priority: Major > Labels: foundations, mesosphere > Attachments: Screen Shot 2019-06-27 at 15.07.20.png > > > For testing persistent volumes with {{OPERATION_FAILED/ERROR}} feedbacks, we > sshed into the mesos-agent and made it unable to create subdirectories in > {{/srv/mesos/work/volumes}}, however, mesos did not respond any operation > failed response. Instead, we received {{OPERATION_FINISHED}} feedback. > Steps to recreate the issue: > 1. Ssh into a magent. > 2. Make it impossible to create a persistent volume (we expect the agent to > crash and reregister, and the master to release that the operation is > {{OPERATION_DROPPED}}): > * cd /srv/mesos/work (if it doesn't exist mkdir /srv/mesos/work/volumes) > * chattr -RV +i volumes (then no subdirectories can be created) > 3. Launch a service with persistent volumes with the constraint of only using > the magent modified above. > > > Logs for the scheduler for receiving `OPERATION_FINISHED`: > (Also see screenshot) > > 2019-06-27 21:57:11.879 [12768651|rdar://12768651] > [Jarvis-mesos-dispatcher-105] INFO c.a.j.s.ServicePodInstance - Stored > operation=4g3k02s1gjb0q_5f912b59-a32d-462c-9c46-8401eba4d2c1 and > feedback=OPERATION_FINISHED in podInstanceID=4g3k02s1gjb0q on > serviceID=yifan-badagents-1 > > * 2019-06-27 21:55:23: task reached state TASK_FAILED for mesos reason: > REASON_CONTAINER_LAUNCH_FAILED with mesos message: Failed to launch > container: Failed to change the ownership of the persistent volume at > '/srv/mesos/work/volumes/roles/test-2/19b564e8-3a90-4f2f-981d-b3dd2a5d9f90' > with uid 264 and gid 264: No such file or directory -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-7580) Use root fs as lower RO layer and container fs as upper layer
[ https://issues.apache.org/jira/browse/MESOS-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895849#comment-16895849 ] James Peach commented on MESOS-7580: I think that MESOS-9900 is related to this request. In MESOS-9900, any changes to the overlayfs upperdir will be charged to the container disk quota. > Use root fs as lower RO layer and container fs as upper layer > - > > Key: MESOS-7580 > URL: https://issues.apache.org/jira/browse/MESOS-7580 > Project: Mesos > Issue Type: Wish > Components: containerization >Reporter: Mikhail Lesyk >Priority: Major > > See example: > {code} > mkdir -p rootfs/{opt,container,workdir,result} > mount -t overlay -o > lowerdir=rootfs,upperdir=rootfs/container,workdir=rootfs/workdir none > rootfs/result > touch rootfs/result/opt/trash > umount rootfs/result > ls -a rootfs/opt/ > . .. > {code} > Where rootfs - imaginary root filesystem > rootfs/opt - variable directory on that filesystem > rootfs/container - container work dir > rootfs/result - result overlayfs mountpoint(root fs from container point of > view) > So, any change under rootfs/result will be not visible from rootfs point of > view and it will remain clean, so every container could have own snapshot of > host's root filesystem, but changes would be individual. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (MESOS-9900) Include overlayfs upperdir in disk quota accounting.
[ https://issues.apache.org/jira/browse/MESOS-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-9900: -- Assignee: James Peach > Include overlayfs upperdir in disk quota accounting. > > > Key: MESOS-9900 > URL: https://issues.apache.org/jira/browse/MESOS-9900 > Project: Mesos > Issue Type: Improvement > Components: containerization, storage >Reporter: James Peach >Assignee: James Peach >Priority: Major > > Currently, the overlayfs upperdir is not included in any disk quota > accounting. This means that a task can write arbitrary amounts of data to > /tmp and will escape the sandbox disk quota. > Propose that we propagate the overlayfs upperdir directory to the disk > isolators so that they can manage this storage, and include it in the total > sandbox usage quota. This would need to be supported by both {{disk/du}} and > {{disk/xfs}} isolators. We should be able to propagate the additional > information out of the provisioner in {{ProvisionInfo}} and then into > {{ContainerConfig}}. > The proposed semantics would be that both the sandbox and overlayfs upperdir > usage would count towards the ephemeral disk quota. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (MESOS-9900) Include overlays upperdir in disk quota accounting.
James Peach created MESOS-9900: -- Summary: Include overlays upperdir in disk quota accounting. Key: MESOS-9900 URL: https://issues.apache.org/jira/browse/MESOS-9900 Project: Mesos Issue Type: Improvement Components: containerization, storage Reporter: James Peach Currently, the overlayfs upperdir is not included in any disk quota accounting. This means that a task can write arbitrary amounts of data to /tmp and will escape the sandbox disk quota. Propose that we propagate the overlayfs upperdir directory to the disk isolators so that they can manage this storage, and include it in the total sandbox usage quota. This would need to be supported by both {{disk/du}} and {{disk/xfs}} isolators. We should be able to propagate the additional information out of the provisioner in {{ProvisionInfo}} and then into {{ContainerConfig}}. The proposed semantics would be that both the sandbox and overlayfs upperdir usage would count towards the ephemeral disk quota. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-9898) Add framework control over the no-new-privileges flag.
[ https://issues.apache.org/jira/browse/MESOS-9898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888461#comment-16888461 ] James Peach commented on MESOS-9898: /cc [~jjanco] > Add framework control over the no-new-privileges flag. > -- > > Key: MESOS-9898 > URL: https://issues.apache.org/jira/browse/MESOS-9898 > Project: Mesos > Issue Type: Improvement > Components: containerization, HTTP API >Reporter: James Peach >Priority: Major > > Following on from MESOS-9770, we can add framework control over whether the > no-new-privileges flag. > The implementation is to add a `no_new_privileges` boolean to the > {{SeccompInfo}} message that will allow a framework to toggle it on and off. > This means that the seccomp isolator must be ordered after the nnp isolator > so that it has priority (last writer wins in a protobuf merge). The nnp > isolator will still unconditionally set the flag. > Design doc: > https://docs.google.com/document/d/1x9S94-P0-nsXHGrwY4BHZ_NEC_bTFMIsDkxxaTd5Vok/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (MESOS-9898) Add framework control over the no-new-privileges flag.
James Peach created MESOS-9898: -- Summary: Add framework control over the no-new-privileges flag. Key: MESOS-9898 URL: https://issues.apache.org/jira/browse/MESOS-9898 Project: Mesos Issue Type: Improvement Components: containerization, HTTP API Reporter: James Peach Following on from MESOS-9770, we can add framework control over whether the no-new-privileges flag. The implementation is to add a `no_new_privileges` boolean to the {{SeccompInfo}} message that will allow a framework to toggle it on and off. This means that the seccomp isolator must be ordered after the nnp isolator so that it has priority (last writer wins in a protobuf merge). The nnp isolator will still unconditionally set the flag. Design doc: https://docs.google.com/document/d/1x9S94-P0-nsXHGrwY4BHZ_NEC_bTFMIsDkxxaTd5Vok/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-9770) Add no-new-privileges isolator.
[ https://issues.apache.org/jira/browse/MESOS-9770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888455#comment-16888455 ] James Peach commented on MESOS-9770: | https://reviews.apache.org/r/71106/ | | https://reviews.apache.org/r/70757/| | https://reviews.apache.org/r/71107/ | > Add no-new-privileges isolator. > --- > > Key: MESOS-9770 > URL: https://issues.apache.org/jira/browse/MESOS-9770 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: James Peach >Assignee: Jacob Janco >Priority: Major > > To give security-minded operators more defense in depth, add a {{linux/nnp}} > isolator that sets the no-new-privileges bit before starting the executor. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-9875) Mesos did not respond correctly when operations should fail
[ https://issues.apache.org/jira/browse/MESOS-9875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876736#comment-16876736 ] James Peach commented on MESOS-9875: {{f9330006-d885-4ef0-b2c7-c9c6fcc239e5}} is the persistence ID. {{5fa5c810-2dd3-41cb-9633-a3ef404b08c4}} is the operation UUID. {{honvr62494cqk_ff4e953f-0eca-4b41-a08d-ddea27980b14}} is the operation ID. {noformat} I0627 22:03:17.360236 3529210 slave.cpp:4282] Updated checkpointed operations from [ cfd6b624-996f-45d7-9aaf-9a13ab9714b4 (RESERVE for framework efd8f75d-25a9-4346-8c7b-d8c8c95ba328-22525, ID: honvr62494cqk_a5b92fff-5491-4616-8970-8c390265c009, latest state: OPERATION_FINISHED) ] to [ cfd6b624-996f-45d7-9aaf-9a13ab9714b4 (RESERVE for framework efd8f75d-25a9-4346-8c7b-d8c8c95ba328-22525, ID: honvr62494cqk_a5b92fff-5491-4616-8970-8c390265c009, latest state: OPERATION_FINISHED), 5fa5c810-2dd3-41cb-9633-a3ef404b08c4 (CREATE for framework efd8f75d-25a9-4346-8c7b-d8c8c95ba328-22525, ID: honvr62494cqk_ff4e953f-0eca-4b41-a08d-ddea27980b14, latest state: OPERATION_PENDING) ] ... I0627 22:03:17.360723 3529210 slave.cpp:8670] Updating the state of operation 'honvr62494cqk_ff4e953f-0eca-4b41-a08d-ddea27980b14' (uuid: 5fa5c810-2dd3-41cb-9633-a3ef404b08c4) for framework efd8f75d-25a9-4346-8c7b-d8c8c95ba328-22525 (latest state: OPERATION_FINISHED, status update state: OPERATION_FINISHED) ... E0627 22:03:17.365811 3529210 slave.cpp:4257] EXIT with status 1: Failed to sync checkpointed resources: Failed to create the persistent volume f9330006-d885-4ef0-b2c7-c9c6fcc239e5 at '/srv/mesos/work/volumes/roles/test-3/f9330006-d885-4ef0-b2c7-c9c6fcc239e5': Operation not permitted {noformat} The relevant code sequence is in Slave::applyOperation, and looks roughly like this: {noformat} track the new operation checkpointResourceState() (1) apply the operation (2) report that the operation was applied checkpointResourceState() (3) {noformat} The operation is checkpointed as pending in (1), but no resource changes are made yet. In (3), the operation is applied by making changes to the agent resources. At (3) the checkpointed resources discrepancy is discovered and the agent tries to create the persistent volume and fails. > Mesos did not respond correctly when operations should fail > --- > > Key: MESOS-9875 > URL: https://issues.apache.org/jira/browse/MESOS-9875 > Project: Mesos > Issue Type: Bug >Reporter: Yifan Xing >Priority: Major > > For testing persistent volumes with `OPERATION_FAILED/ERROR` feedbacks, we > sshed into the mesos-agent and made it unable to create subdirectories in > /srv/mesos/work/volumes, however, mesos did not respond any operation failed > response. Instead, we received `OPERATION_FINISHED` feedback. > Steps to recreate the issue: > 1. Ssh into a magent. > 2. Make it impossible to create a persistent volume (we expect the agent to > crash and reregister, and the master to release that the operation is > `OPERATION_DROPPED`): > * cd /srv/mesos/work (if it doesn't exist mkdir /srv/mesos/work/volumes) > * chattr -RV +i volumes (then no subdirectories can be created) > 3. Launch a service with persistent volumes with the constraint of only using > the magent modified above. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9800) libarchive cannot extract tarfile due to UTF-8 encoding issues
[ https://issues.apache.org/jira/browse/MESOS-9800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16868084#comment-16868084 ] James Peach commented on MESOS-9800: Sorry it took so long to get back to you [~falfaro]. We are carrying a revert of 2198b961d24b788564d36490cf52f78d7ec07655 > libarchive cannot extract tarfile due to UTF-8 encoding issues > -- > > Key: MESOS-9800 > URL: https://issues.apache.org/jira/browse/MESOS-9800 > Project: Mesos > Issue Type: Bug > Components: fetcher >Affects Versions: 1.7.2 > Environment: Mesos 1.7.2 and Marathon 1.4.3 running on top of Ubuntu > 16.04. >Reporter: Felipe Alfaro Solana >Priority: Major > Attachments: certificates2.tar.gz > > > Starting with Mesos 1.7, the following change has been introduced: > * [MESOS-8064] - Mesos now requires libarchive to programmatically decode > .zip, .tar, .gzip, and other common file compression schemes. Version 3.3.2 > is bundled in Mesos. > However, this version of libarchive which is used by the fetcher component in > Mesos has problems in dealing with archive files (.tar and .zip) which > contain UTF-8 characters. We run Marahton on top of Mesos, and one of our > Marathon application relies on a .tar file which contains symlinks whose > target contains certain UTF-8 characters (Turkish) or the symlink name itself > contains UTF-8 characters. Mesos fetcher is unable to extract the archive and > fails with the following error: > {{May 28 10:47:30 t01m01.node.t01.dns.teralytics.net mesos-slave[4319]: E0528 > 10:47:30.791250 6136 fetcher.cpp:613] EXIT with status 1: Failed to fetch > '/tmp/certificates.tar.gz': Failed to extract archive > '/var/mesos/slaves/10c35371-f690-4d40-8b9e-30ffd04405fb-S6/frameworks/ff2993eb-987f-47b0-b3af-fb8b49ab0470-/executors/test-nginx.fe01a0c0-8135-11e9-a160-02427a38aa03/runs/6a6e87e8-5eef-4e8e-8c00-3f081fa187b0/certificates.tar.gz' > to > '/var/mesos/slaves/10c35371-f690-4d40-8b9e-30ffd04405fb-S6/frameworks/ff2993eb-987f-47b0-b3af-fb8b49ab0470-/executors/test-nginx.fe01a0c0-8135-11e9-a160-02427a38aa03/runs/6a6e87e8-5eef-4e8e-8c00-3f081fa187b0': > Failed to read archive header: Linkname can't be converted from UTF-8 to > current locale.}} > {{May 28 10:47:30 t01m01.node.t01.dns.teralytics.net mesos-slave[4319]:}} > {{May 28 10:47:30 t01m01.node.t01.dns.teralytics.net mesos-slave[4319]: End > fetcher log for container 6a6e87e8-5eef-4e8e-8c00-3f081fa187b0}} > {{May 28 10:47:30 t01m01.node.t01.dns.teralytics.net mesos-slave[4319]: E0528 > 10:47:30.846695 4343 fetcher.cpp:571] Failed to run mesos-fetcher: Failed to > fetch all URIs for container '6a6e87e8-5eef-4e8e-8c00-3f081fa187b0': exited > with status 1}} > The same Marathon application works fine with Mesos 1.6 which does not use > libarchive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9804) Subprocess should close inherited file descriptors earlier.
[ https://issues.apache.org/jira/browse/MESOS-9804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16865376#comment-16865376 ] James Peach commented on MESOS-9804: This is not to be fixed. The current code doesn't close after the fork, but does mark the inherited descriptors {{CLOEXEC}}. If we close these instead, then it would be harder for subprocess hooks to pass a fd into the child and use it in a child hook, which is a legitimate and useful pattern. If we don't close it, then we have the same semantics as today. So I think that the current code works correctly. > Subprocess should close inherited file descriptors earlier. > --- > > Key: MESOS-9804 > URL: https://issues.apache.org/jira/browse/MESOS-9804 > Project: Mesos > Issue Type: Improvement > Components: libprocess >Reporter: James Peach >Priority: Major > > The libprocess {{subprocess}} API doesn't close the file descriptors that are > inherited across fork until after applying the child hooks. This means that > the inherited descriptors can remain open for much longer than you expect, > since parent and child hooks both need to be scheduled and run. > We should move the file descriptor closing as early as possible in the child. > We might also consider having the child write a byte back to the parent so > that we have a guaranteed synchronization point. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9848) Blkio cgroup statistics files missing in Linux 5.1
[ https://issues.apache.org/jira/browse/MESOS-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16865230#comment-16865230 ] James Peach commented on MESOS-9848: /cc [~jieyu] [~gilbert] [~qianzhang] > Blkio cgroup statistics files missing in Linux 5.1 > -- > > Key: MESOS-9848 > URL: https://issues.apache.org/jira/browse/MESOS-9848 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: James Peach >Priority: Major > > In recent Fedora release, the Linux blkio cgroup no longer publishes certain > stats files that the Mesos isolator expects should exist. > In {{BlkioSubsystemProcess::usage}}, the isolator looks for > * {{blkio.time}} > * {{blkio.sectors}} > * {{blkio.io_merged}} > * {{blkio.io_queued}} > Here's the actual cgroup: > {noformat} > $ uname -r > 5.1.8-300.fc30.x86_64 > ... > [root@jpeach 184cf411-e73f-4c6e-bd54-8181222801af]# pwd > /sys/fs/cgroup/blkio/mesos_test_c83596ce-76ff-47c8-b23d-1276c16e93ae/184cf411-e73f-4c6e-bd54-8181222801af > [root@jpeach 184cf411-e73f-4c6e-bd54-8181222801af]# ls -l > total 0 > -r--r--r-- 1 root root 0 Jun 16 18:07 blkio.bfq.io_service_bytes > -r--r--r-- 1 root root 0 Jun 16 18:07 blkio.bfq.io_service_bytes_recursive > -r--r--r-- 1 root root 0 Jun 16 18:07 blkio.bfq.io_serviced > -r--r--r-- 1 root root 0 Jun 16 18:07 blkio.bfq.io_serviced_recursive > -rw-r--r-- 1 root root 0 Jun 16 18:07 blkio.bfq.weight > --w--- 1 root root 0 Jun 16 18:07 blkio.reset_stats > -r--r--r-- 1 root root 0 Jun 16 18:07 blkio.throttle.io_service_bytes > -r--r--r-- 1 root root 0 Jun 16 18:07 > blkio.throttle.io_service_bytes_recursive > -r--r--r-- 1 root root 0 Jun 16 18:07 blkio.throttle.io_serviced > -r--r--r-- 1 root root 0 Jun 16 18:07 blkio.throttle.io_serviced_recursive > -rw-r--r-- 1 root root 0 Jun 16 18:07 blkio.throttle.read_bps_device > -rw-r--r-- 1 root root 0 Jun 16 18:07 blkio.throttle.read_iops_device > -rw-r--r-- 1 root root 0 Jun 16 18:07 blkio.throttle.write_bps_device > -rw-r--r-- 1 root root 0 Jun 16 18:07 blkio.throttle.write_iops_device > -rw-r--r-- 1 root root 0 Jun 16 18:07 cgroup.clone_children > -rw-r--r-- 1 root root 0 Jun 16 18:06 cgroup.procs > -rw-r--r-- 1 root root 0 Jun 16 18:07 notify_on_release > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9848) Blkio cgroup statistis files missing in Linux 5.1
James Peach created MESOS-9848: -- Summary: Blkio cgroup statistis files missing in Linux 5.1 Key: MESOS-9848 URL: https://issues.apache.org/jira/browse/MESOS-9848 Project: Mesos Issue Type: Improvement Components: containerization Reporter: James Peach In recent Fedora release, the Linux blkio cgroup no longer publishes certain stats files that the Mesos isolator expects should exist. In {{BlkioSubsystemProcess::usage}}, the isolator looks for * {{blkio.time}} * {{blkio.sectors}} * {{blkio.io_merged}} * {{blkio.io_queued}} Here's the actual cgroup: {noformat} $ uname -r 5.1.8-300.fc30.x86_64 ... [root@jpeach 184cf411-e73f-4c6e-bd54-8181222801af]# pwd /sys/fs/cgroup/blkio/mesos_test_c83596ce-76ff-47c8-b23d-1276c16e93ae/184cf411-e73f-4c6e-bd54-8181222801af [root@jpeach 184cf411-e73f-4c6e-bd54-8181222801af]# ls -l total 0 -r--r--r-- 1 root root 0 Jun 16 18:07 blkio.bfq.io_service_bytes -r--r--r-- 1 root root 0 Jun 16 18:07 blkio.bfq.io_service_bytes_recursive -r--r--r-- 1 root root 0 Jun 16 18:07 blkio.bfq.io_serviced -r--r--r-- 1 root root 0 Jun 16 18:07 blkio.bfq.io_serviced_recursive -rw-r--r-- 1 root root 0 Jun 16 18:07 blkio.bfq.weight --w--- 1 root root 0 Jun 16 18:07 blkio.reset_stats -r--r--r-- 1 root root 0 Jun 16 18:07 blkio.throttle.io_service_bytes -r--r--r-- 1 root root 0 Jun 16 18:07 blkio.throttle.io_service_bytes_recursive -r--r--r-- 1 root root 0 Jun 16 18:07 blkio.throttle.io_serviced -r--r--r-- 1 root root 0 Jun 16 18:07 blkio.throttle.io_serviced_recursive -rw-r--r-- 1 root root 0 Jun 16 18:07 blkio.throttle.read_bps_device -rw-r--r-- 1 root root 0 Jun 16 18:07 blkio.throttle.read_iops_device -rw-r--r-- 1 root root 0 Jun 16 18:07 blkio.throttle.write_bps_device -rw-r--r-- 1 root root 0 Jun 16 18:07 blkio.throttle.write_iops_device -rw-r--r-- 1 root root 0 Jun 16 18:07 cgroup.clone_children -rw-r--r-- 1 root root 0 Jun 16 18:06 cgroup.procs -rw-r--r-- 1 root root 0 Jun 16 18:07 notify_on_release {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9805) Run cgroup subsystems before moving the target PID.
[ https://issues.apache.org/jira/browse/MESOS-9805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-9805: -- Resolution: Fixed Assignee: James Peach Target Version/s: 1.9.0 > Run cgroup subsystems before moving the target PID. > --- > > Key: MESOS-9805 > URL: https://issues.apache.org/jira/browse/MESOS-9805 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: James Peach >Assignee: James Peach >Priority: Major > > Currently, the Pid targeted by the cgroups isolator is moved into the cgroup > before the subsystem runs to apply any type-specific cgroup configuration. We > should reverse the order of this so that the PID is only moved once the > cgroup is fully configured by the subsystem. > The specific use case that affected us was where a PID was assigned to a > {{net_cls}} cgroup before that cgroup had the class ID set. This caused a > separate system to become confused. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9800) libarchive cannot extract tarfile due to UTF-8 encoding issues
[ https://issues.apache.org/jira/browse/MESOS-9800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855625#comment-16855625 ] James Peach commented on MESOS-9800: We hit the same problem internally a while ago, and carried a patch to refer to using {{/usr/bin/tar}}. If you are building your own Mesos, try passing the {{\-\-with-llibarchive}} flag to use the system library, which is likely to have been built with {{iconv}} support. > libarchive cannot extract tarfile due to UTF-8 encoding issues > -- > > Key: MESOS-9800 > URL: https://issues.apache.org/jira/browse/MESOS-9800 > Project: Mesos > Issue Type: Bug > Components: fetcher >Affects Versions: 1.7.2 > Environment: Mesos 1.7.2 and Marathon 1.4.3 running on top of Ubuntu > 16.04. >Reporter: Felipe Alfaro Solana >Priority: Major > Attachments: certificates2.tar.gz > > > Starting with Mesos 1.7, the following change has been introduced: > * [MESOS-8064] - Mesos now requires libarchive to programmatically decode > .zip, .tar, .gzip, and other common file compression schemes. Version 3.3.2 > is bundled in Mesos. > However, this version of libarchive which is used by the fetcher component in > Mesos has problems in dealing with archive files (.tar and .zip) which > contain UTF-8 characters. We run Marahton on top of Mesos, and one of our > Marathon application relies on a .tar file which contains symlinks whose > target contains certain UTF-8 characters (Turkish) or the symlink name itself > contains UTF-8 characters. Mesos fetcher is unable to extract the archive and > fails with the following error: > {{May 28 10:47:30 t01m01.node.t01.dns.teralytics.net mesos-slave[4319]: E0528 > 10:47:30.791250 6136 fetcher.cpp:613] EXIT with status 1: Failed to fetch > '/tmp/certificates.tar.gz': Failed to extract archive > '/var/mesos/slaves/10c35371-f690-4d40-8b9e-30ffd04405fb-S6/frameworks/ff2993eb-987f-47b0-b3af-fb8b49ab0470-/executors/test-nginx.fe01a0c0-8135-11e9-a160-02427a38aa03/runs/6a6e87e8-5eef-4e8e-8c00-3f081fa187b0/certificates.tar.gz' > to > '/var/mesos/slaves/10c35371-f690-4d40-8b9e-30ffd04405fb-S6/frameworks/ff2993eb-987f-47b0-b3af-fb8b49ab0470-/executors/test-nginx.fe01a0c0-8135-11e9-a160-02427a38aa03/runs/6a6e87e8-5eef-4e8e-8c00-3f081fa187b0': > Failed to read archive header: Linkname can't be converted from UTF-8 to > current locale.}} > {{May 28 10:47:30 t01m01.node.t01.dns.teralytics.net mesos-slave[4319]:}} > {{May 28 10:47:30 t01m01.node.t01.dns.teralytics.net mesos-slave[4319]: End > fetcher log for container 6a6e87e8-5eef-4e8e-8c00-3f081fa187b0}} > {{May 28 10:47:30 t01m01.node.t01.dns.teralytics.net mesos-slave[4319]: E0528 > 10:47:30.846695 4343 fetcher.cpp:571] Failed to run mesos-fetcher: Failed to > fetch all URIs for container '6a6e87e8-5eef-4e8e-8c00-3f081fa187b0': exited > with status 1}} > The same Marathon application works fine with Mesos 1.6 which does not use > libarchive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9805) Run cgroup subsystems before moving the target PID.
[ https://issues.apache.org/jira/browse/MESOS-9805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852665#comment-16852665 ] James Peach commented on MESOS-9805: /cc [~gilbert], [~jieyu] [~qianzhang] > Run cgroup subsystems before moving the target PID. > --- > > Key: MESOS-9805 > URL: https://issues.apache.org/jira/browse/MESOS-9805 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: James Peach >Priority: Major > > Currently, the Pid targeted by the cgroups isolator is moved into the cgroup > before the subsystem runs to apply any type-specific cgroup configuration. We > should reverse the order of this so that the PID is only moved once the > cgroup is fully configured by the subsystem. > The specific use case that affected us was where a PID was assigned to a > {{net_cls}} cgroup before that cgroup had the class ID set. This caused a > separate system to become confused. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9805) Run cgroup subsystems before moving the target PID.
James Peach created MESOS-9805: -- Summary: Run cgroup subsystems before moving the target PID. Key: MESOS-9805 URL: https://issues.apache.org/jira/browse/MESOS-9805 Project: Mesos Issue Type: Improvement Components: containerization Reporter: James Peach Currently, the Pid targeted by the cgroups isolator is moved into the cgroup before the subsystem runs to apply any type-specific cgroup configuration. We should reverse the order of this so that the PID is only moved once the cgroup is fully configured by the subsystem. The specific use case that affected us was where a PID was assigned to a {{net_cls}} cgroup before that cgroup had the class ID set. This caused a separate system to become confused. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9804) Subprocess should close inherited file descriptors earlier
James Peach created MESOS-9804: -- Summary: Subprocess should close inherited file descriptors earlier Key: MESOS-9804 URL: https://issues.apache.org/jira/browse/MESOS-9804 Project: Mesos Issue Type: Improvement Components: libprocess Reporter: James Peach The libprocess {{subprocess}} API doesn't close the file descriptors that are inherited across fork until after applying the child hooks. This means that the inherited descriptors can remain open for much longer than you expect, since parent and child hooks both need to be scheduled and run. We should move the file descriptor closing as early as possible in the child. We might also consider having the child write a byte back to the parent so that we have a guaranteed synchronization point. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9799) Adopt container file operations in secrets volumes.
[ https://issues.apache.org/jira/browse/MESOS-9799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16850489#comment-16850489 ] James Peach commented on MESOS-9799: | [r/70741|https://reviews.apache.org/r/70741] | Adopted container file operations for secrets volumes. | > Adopt container file operations in secrets volumes. > --- > > Key: MESOS-9799 > URL: https://issues.apache.org/jira/browse/MESOS-9799 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: James Peach >Assignee: James Peach >Priority: Major > > Adopt containerized file operations in the secrets volume isolator so that it > doesn't have to use pre-exec commands. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9799) Adopt container file operations in secrets volumes.
[ https://issues.apache.org/jira/browse/MESOS-9799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-9799: -- Assignee: James Peach > Adopt container file operations in secrets volumes. > --- > > Key: MESOS-9799 > URL: https://issues.apache.org/jira/browse/MESOS-9799 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: James Peach >Assignee: James Peach >Priority: Major > > Adopt containerized file operations in the secrets volume isolator so that it > doesn't have to use pre-exec commands. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9799) Adopt container file operations in secrets volumes.
James Peach created MESOS-9799: -- Summary: Adopt container file operations in secrets volumes. Key: MESOS-9799 URL: https://issues.apache.org/jira/browse/MESOS-9799 Project: Mesos Issue Type: Improvement Components: containerization Reporter: James Peach Adopt containerized file operations in the secrets volume isolator so that it doesn't have to use pre-exec commands. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9769) Add direct containerized support for filesystem operations.
[ https://issues.apache.org/jira/browse/MESOS-9769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-9769: -- Assignee: James Peach > Add direct containerized support for filesystem operations. > --- > > Key: MESOS-9769 > URL: https://issues.apache.org/jira/browse/MESOS-9769 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: James Peach >Assignee: James Peach >Priority: Major > > When setting up the container filesystems, we use `pre_exec_commands` to make > ABI symlinks and other things. The problem with this is that, depending of > the order of operations, we may not have the full security policy in place > yet, but since we are running in the context of the container's mount > namespaces, the programs we execute are under the control of whoever built > the container image. > [~jieyu] and I previously discussed adding filesystem operations to the > `ContainerLaunchInfo`. Just `ln` would be sufficient for the `cgroups` and > `linux/filesystem` isolators. Secrets and port mapping isolators need more, > so we should discuss and file new tickets if necessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9768) Allow operators to mount the container rootfs with the `nosuid` flag
[ https://issues.apache.org/jira/browse/MESOS-9768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844438#comment-16844438 ] James Peach commented on MESOS-9768: {quote} What we are primarily interested in is to set it for for the overlay backend but there are multiple backend options. Seems like a common flag --image_mount_options could be applicable to bind backend as well (maybe aufs too? Gilbert Song). It doesn't apply to the copy backend of course. {quote} I think that the main mount options that applies to non-overlayfs backends is {{MS_RDONLY}}. Since you only get one image provisioner backend, I think that a single global option is OK. Each backend can error out it there are any mount options provided that it can't support. Making this a per-container option is more complex. We can table the issue of mount flags for non-image volumes here, since I expect that the configuration for that will be different. > Allow operators to mount the container rootfs with the `nosuid` flag > > > Key: MESOS-9768 > URL: https://issues.apache.org/jira/browse/MESOS-9768 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: James Peach >Priority: Major > > If cluster users are allowed to launch containers with arbitrary images, > those images may container setuid programs. For security reasons (auditing, > privilege escalation), operators may wish to ensure that setuid programs > cannot be used within a container. > > We should provide a way for operators to be able to specify that container > volumes (including `/`0 should be mounted with the `nosuid` flag. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9771) Mask sensitive procfs paths.
[ https://issues.apache.org/jira/browse/MESOS-9771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-9771: -- Assignee: James Peach | [r/70678|https://reviews.apache.org/r/70678] | Add containerizer support for masking paths. | > Mask sensitive procfs paths. > > > Key: MESOS-9771 > URL: https://issues.apache.org/jira/browse/MESOS-9771 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: James Peach >Assignee: James Peach >Priority: Major > > We already have a set of procfs paths that we mark read-only in the > containerizer, but there are additional paths that are considered sensitive > by other containerizers and are masked altogether: > {noformat} > "/proc/asound" > "/proc/acpi" > "/proc/kcore" > "/proc/keys" > "/proc/latency_stats" > "/proc/timer_list" > "/proc/timer_stats" > "/proc/sched_debug" > "/sys/firmware" > "/proc/scsi" > {noformat} > Masking is done by mounting {{/dev/null}} on files, and an empty, readonly > {{tmpfs}} on directories. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9771) Mask sensitive procfs paths.
[ https://issues.apache.org/jira/browse/MESOS-9771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834402#comment-16834402 ] James Peach commented on MESOS-9771: Since {{/proc/keys}} gets masked, we should probably mask {{/proc/key-users}} too. Weird that I don't see other containerizers doing that. My main concern with this change is compatibility with containerized services like CSI, that may need privileged access to the host. Masking all these paths for this kind of service could break them. There are a few possible solutions: 1. Skip the masking based on properties of the launch, e.g. whether the Docker {{privileged}} flag is set, or whether the container is joining the host's PID namespace. 2. Add a flag that specified the set of paths to mask, so that operators can whack it with configuration. 3. Unconditionally do the masking. If we go down the path of (2), then operators who need privileged containers to see this information will be stranded, so my preference would be something closer to (1). If we prefer (3), then we already unconditionally make certain container paths read-only, which could be regarded as precedent. /cc [~jieyu] [~gilbert] [~jasonlai] > Mask sensitive procfs paths. > > > Key: MESOS-9771 > URL: https://issues.apache.org/jira/browse/MESOS-9771 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: James Peach >Priority: Major > > We already have a set of procfs paths that we mark read-only in the > containerizer, but there are additional paths that are considered sensitive > by other containerizers and are masked altogether: > {noformat} >"/proc/asound" >"/proc/acpi" > "/proc/kcore" > "/proc/keys" > "/proc/latency_stats" > "/proc/timer_list" > "/proc/timer_stats" > "/proc/sched_debug" > "/sys/firmware" > "/proc/scsi" > {noformat} > Masking is done by mounting {{/dev/null}} on files, and an empty, readonly > {{tmpfs}} on directories. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9771) Mask sensitive procfs paths.
James Peach created MESOS-9771: -- Summary: Mask sensitive procfs paths. Key: MESOS-9771 URL: https://issues.apache.org/jira/browse/MESOS-9771 Project: Mesos Issue Type: Improvement Components: containerization Reporter: James Peach We already have a set of procfs paths that we mark read-only in the containerizer, but there are additional paths that are considered sensitive by other containerizers and are masked altogether: {noformat} "/proc/asound" "/proc/acpi" "/proc/kcore" "/proc/keys" "/proc/latency_stats" "/proc/timer_list" "/proc/timer_stats" "/proc/sched_debug" "/sys/firmware" "/proc/scsi" {noformat} Masking is done by mounting {{/dev/null}} on files, and an empty, readonly {{tmpfs}} on directories. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9770) Add no-new-privileges isolator
[ https://issues.apache.org/jira/browse/MESOS-9770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834398#comment-16834398 ] James Peach commented on MESOS-9770: /cc [~jieyu] [~gilbert] [~abudnik] > Add no-new-privileges isolator > -- > > Key: MESOS-9770 > URL: https://issues.apache.org/jira/browse/MESOS-9770 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: James Peach >Priority: Major > > To give security-minded operators more defense in depth, add a {{linux/nnp}} > isolator that sets the no-new-privileges bit before starting the executor. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9770) Add no-new-privileges isolator
James Peach created MESOS-9770: -- Summary: Add no-new-privileges isolator Key: MESOS-9770 URL: https://issues.apache.org/jira/browse/MESOS-9770 Project: Mesos Issue Type: Improvement Components: containerization Reporter: James Peach To give security-minded operators more defense in depth, add a {{linux/nnp}} isolator that sets the no-new-privileges bit before starting the executor. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9769) Add direct containerized support for filesystem operations
James Peach created MESOS-9769: -- Summary: Add direct containerized support for filesystem operations Key: MESOS-9769 URL: https://issues.apache.org/jira/browse/MESOS-9769 Project: Mesos Issue Type: Improvement Components: containerization Reporter: James Peach When setting up the container filesystems, we use `pre_exec_commands` to make ABI symlinks and other things. The problem with this is that, depending of the order of operations, we may not have the full security policy in place yet, but since we are running in the context of the container's mount namespaces, the programs we execute are under the control of whoever built the container image. [~jieyu] and I previously discussed adding filesystem operations to the `ContainerLaunchInfo`. Just `ln` would be sufficient for the `cgroups` and `linux/filesystem` isolators. Secrets and port mapping isolators need more, so we should discuss and file new tickets if necessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-9768) Allow operators to mount the container rootfs with the `nosuid` flag
[ https://issues.apache.org/jira/browse/MESOS-9768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834357#comment-16834357 ] James Peach edited comment on MESOS-9768 at 5/7/19 3:56 AM: /cc [~jieyu] [~gilbert] was (Author: jamespeach): /cc [~jieyu] @gilbert > Allow operators to mount the container rootfs with the `nosuid` flag > > > Key: MESOS-9768 > URL: https://issues.apache.org/jira/browse/MESOS-9768 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: James Peach >Priority: Major > > If cluster users are allowed to launch containers with arbitrary images, > those images may container setuid programs. For security reasons (auditing, > privilege escalation), operators may wish to ensure that setuid programs > cannot be used within a container. > > We should provide a way for operators to be able to specify that container > volumes (including `/`0 should be mounted with the `nosuid` flag. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9768) Allow operators to mount the container rootfs with the `nosuid` flag
[ https://issues.apache.org/jira/browse/MESOS-9768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834357#comment-16834357 ] James Peach commented on MESOS-9768: /cc [~jieyu] @gilbert > Allow operators to mount the container rootfs with the `nosuid` flag > > > Key: MESOS-9768 > URL: https://issues.apache.org/jira/browse/MESOS-9768 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: James Peach >Priority: Major > > If cluster users are allowed to launch containers with arbitrary images, > those images may container setuid programs. For security reasons (auditing, > privilege escalation), operators may wish to ensure that setuid programs > cannot be used within a container. > > We should provide a way for operators to be able to specify that container > volumes (including `/`0 should be mounted with the `nosuid` flag. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9768) Allow operators to mount the container rootfs with the `nosuid` flag
James Peach created MESOS-9768: -- Summary: Allow operators to mount the container rootfs with the `nosuid` flag Key: MESOS-9768 URL: https://issues.apache.org/jira/browse/MESOS-9768 Project: Mesos Issue Type: Improvement Components: containerization Reporter: James Peach If cluster users are allowed to launch containers with arbitrary images, those images may container setuid programs. For security reasons (auditing, privilege escalation), operators may wish to ensure that setuid programs cannot be used within a container. We should provide a way for operators to be able to specify that container volumes (including `/`0 should be mounted with the `nosuid` flag. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9349) Prevent ptracing of container management processes.
[ https://issues.apache.org/jira/browse/MESOS-9349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-9349: -- Assignee: James Peach Priority: Minor (was: Major) Fix Version/s: 1.8.0 Issue Type: Improvement (was: Bug) | [r/69615|https://reviews.apache.org/r/69615] | Disable containerizer ptrace attach. | > Prevent ptracing of container management processes. > --- > > Key: MESOS-9349 > URL: https://issues.apache.org/jira/browse/MESOS-9349 > Project: Mesos > Issue Type: Improvement > Components: containerization, security >Reporter: James Peach >Assignee: James Peach >Priority: Minor > Fix For: 1.8.0 > > > The container launcher and the built-in executors are (at least partially) > accessible to containerized user tasks. Since these processes may contain > secrets or hold privileged resources, we can increase the difficulty of > attacking them by preventing user tasks attaching to them with ptrace(2). > This amounts to calling `prctl(PR_SET_DUMPABLE, 0)`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9319) Move root filesystem creation to the `filesystem/linux` isolator.
[ https://issues.apache.org/jira/browse/MESOS-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699772#comment-16699772 ] James Peach commented on MESOS-9319: Updated patch series: | [r/69211|https://reviews.apache.org/r/69211] | Improved the code comments for `getContainerDevicesPath`. | | [r/69210|https://reviews.apache.org/r/69210] | Used the MS_SILENT mount flag to elide unwanted logging. | | [r/69086|https://reviews.apache.org/r/69086] | Moved the container root construction to the isolators. | | [r/69450|https://reviews.apache.org/r/69450] | Applied the `ContainerMountInfo` protobuf helper. | > Move root filesystem creation to the `filesystem/linux` isolator. > - > > Key: MESOS-9319 > URL: https://issues.apache.org/jira/browse/MESOS-9319 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: James Peach >Assignee: James Peach >Priority: Major > > When using a custom user namespace isolator, the task fails at launch because > opening devices fails with a EPERM error. This problem is described in [this > systemd issue|https://github.com/systemd/systemd/pull/9483] and [this > lxd|https://github.com/lxc/lxd/issues/4950] issue. > The problem arises in the Mesos containerizer due to the order of operations: > # Clone the containerizer with {{CLONE_NEWNS}} > # Mount a tmpfs for the devices > # mknod for the various device nodes > Referring back to the lxc issue, because we do (1) before (2), the tmpfs on > {{/dev}} is marked {{SB_I_NODEV}}. Due to the new 4.18 behavior, the mkdir in > (3) now succeeds (see commit > [55956b59df33|https://github.com/torvalds/linux/commit/55956b59df336f6738da916dbb520b6e37df9fbd]). > Previously it would fail and we would fall back to bind mounting the device. > However, even though we created the device, we can't actually open it due to > the {{SB_I_NODEV}} flag on the tmpfs mount. It appears that the purpose of > allowing mknod is to that containers can create overlayfs whiteouts. > One approach to deal with this in the Mesos containerizer is to complete the > device node cleanup that was begun in with the linux/devices isolator. This > approach involves moving all the responsibility for creating devices back to > the isolators. Then, at containerization time, we simply bind-mount the whole > of /dev from the per-container staging area. Since the isolators create the > devices in the host namespace and on the Mesos work directory, none of the > conditions that trigger the failure would be invoked. > The failure we observed with our tasks was a failure to open {{/dev/null}}, > when redirecting it as standard input to a child process. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9418) CgroupsIsolatorTest.ROOT_CGROUPS_BlkioUsage fails on 4.19 kernels
[ https://issues.apache.org/jira/browse/MESOS-9418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-9418: -- Assignee: James Peach > CgroupsIsolatorTest.ROOT_CGROUPS_BlkioUsage fails on 4.19 kernels > - > > Key: MESOS-9418 > URL: https://issues.apache.org/jira/browse/MESOS-9418 > Project: Mesos > Issue Type: Bug > Components: containerization, test >Reporter: James Peach >Assignee: James Peach >Priority: Major > > The {{CgroupsIsolatorTest.ROOT_CGROUPS_BlkioUsage}} test fails on Linux 4.19 > kernels. > {noformat} > [jpeach@jpeach mesos]$ uname -r > 4.19.3-300.fc29.x86_64 > [jpeach@jpeach build]$ sudo env GLOG_v=1 ./src/mesos-tests --verbose > --gtest_filter=CgroupsIsolatorTest.ROOT_CGROUPS_BlkioUsage > ... > W1126 10:45:44.941278 30021 cgroups.cpp:895] Skipping resource statistic for > container 8f67e5f9-ebf0-436c-a1d2-f30c69883a27 because: Failed to parse blkio > value '8:0 Discard 0' from 'blkio.io_service_bytes': Invalid major:minor > device number: 'Discard' > ../../../src/tests/containerizer/cgroups_isolator_tests.cpp:1890: Failure > Value of: usage->has_blkio_statistics() > Actual: false > Expected: true > ../../../src/tests/containerizer/cgroups_isolator_tests.cpp:1891: Failure > Expected: (2) <= (usage->blkio_statistics().throttling_size()), actual: 2 vs 0 > ../../../src/tests/containerizer/cgroups_isolator_tests.cpp:1902: Failure > totalThrottling is NONE > mesos-tests: ../../../3rdparty/stout/include/stout/option.hpp:119: T > ::get() & [T = > mesos::CgroupInfo_Blkio_Throttling_Statistics]: Assertion `isSome()' failed. > ... > {noformat} > The actual cgroup format is: > {noformat} > [jpeach@jpeach blkio]$ pwd > /sys/fs/cgroup/blkio > [jpeach@jpeach blkio]$ cat > mesos_test_e9c8e0aa-3172-4d8d-b216-c8f5286a7efc/blkio.io_service_bytes > 8:0 Read 0 > 8:0 Write 0 > 8:0 Sync 0 > 8:0 Async 0 > 8:0 Discard 0 > 8:0 Total 0 > Total 0 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9418) CgroupsIsolatorTest.ROOT_CGROUPS_BlkioUsage fails on 4.19 kernels
James Peach created MESOS-9418: -- Summary: CgroupsIsolatorTest.ROOT_CGROUPS_BlkioUsage fails on 4.19 kernels Key: MESOS-9418 URL: https://issues.apache.org/jira/browse/MESOS-9418 Project: Mesos Issue Type: Bug Components: containerization, test Reporter: James Peach The {{CgroupsIsolatorTest.ROOT_CGROUPS_BlkioUsage}} test fails on Linux 4.19 kernels. {noformat} [jpeach@jpeach mesos]$ uname -r 4.19.3-300.fc29.x86_64 [jpeach@jpeach build]$ sudo env GLOG_v=1 ./src/mesos-tests --verbose --gtest_filter=CgroupsIsolatorTest.ROOT_CGROUPS_BlkioUsage ... W1126 10:45:44.941278 30021 cgroups.cpp:895] Skipping resource statistic for container 8f67e5f9-ebf0-436c-a1d2-f30c69883a27 because: Failed to parse blkio value '8:0 Discard 0' from 'blkio.io_service_bytes': Invalid major:minor device number: 'Discard' ../../../src/tests/containerizer/cgroups_isolator_tests.cpp:1890: Failure Value of: usage->has_blkio_statistics() Actual: false Expected: true ../../../src/tests/containerizer/cgroups_isolator_tests.cpp:1891: Failure Expected: (2) <= (usage->blkio_statistics().throttling_size()), actual: 2 vs 0 ../../../src/tests/containerizer/cgroups_isolator_tests.cpp:1902: Failure totalThrottling is NONE mesos-tests: ../../../3rdparty/stout/include/stout/option.hpp:119: T ::get() & [T = mesos::CgroupInfo_Blkio_Throttling_Statistics]: Assertion `isSome()' failed. ... {noformat} The actual cgroup format is: {noformat} [jpeach@jpeach blkio]$ pwd /sys/fs/cgroup/blkio [jpeach@jpeach blkio]$ cat mesos_test_e9c8e0aa-3172-4d8d-b216-c8f5286a7efc/blkio.io_service_bytes 8:0 Read 0 8:0 Write 0 8:0 Sync 0 8:0 Async 0 8:0 Discard 0 8:0 Total 0 Total 0 {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9393) Fetcher crashes extracting archives with non-ASCII filenames.
[ https://issues.apache.org/jira/browse/MESOS-9393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688874#comment-16688874 ] James Peach commented on MESOS-9393: Probably need to ensure that we are building libarchive with {{\-\-with-iconv}}. > Fetcher crashes extracting archives with non-ASCII filenames. > - > > Key: MESOS-9393 > URL: https://issues.apache.org/jira/browse/MESOS-9393 > Project: Mesos > Issue Type: Bug > Components: fetcher >Reporter: James Peach >Priority: Critical > > {noformat} > (gdb) bt > #0 0x7f2ec3827925 in raise () from /lib64/libc.so.6 > #1 0x7f2ec3829105 in abort () from /lib64/libc.so.6 > #2 0x7f2ec3e5da5d in __gnu_cxx::__verbose_terminate_handler() () from > /usr/lib64/libstdc++.so.6 > #3 0x7f2ec3e5bbe6 in ?? () from /usr/lib64/libstdc++.so.6 > #4 0x7f2ec3e5bc13 in std::terminate() () from /usr/lib64/libstdc++.so.6 > #5 0x7f2ec3e5bd0e in __cxa_throw () from /usr/lib64/libstdc++.so.6 > #6 0x7f2ec3e00837 in std::__throw_logic_error(char const*) () from > /usr/lib64/libstdc++.so.6 > #7 0x7f2ec3e3be59 in ?? () from /usr/lib64/libstdc++.so.6 > #8 0x7f2ec3e3bf33 in std::basic_string, > std::allocator >::basic_string(char const*, std::allocator > const&) () >from /usr/lib64/libstdc++.so.6 > #9 0x555f5e843a6d in archiver::extract (source=..., > > destination="/tmp/mesos/slaves/04f97156-23b7-4411-8fa7-bdec71518221-S1320/frameworks/156b4459-4bb6-460b-89e5-d8c583dee257-0413/executors/cstapper-test-service.simple-pod.test.0.ti9dgngkdceq2_0/runs/4a2a188e-54ef-4"..., > flags=) at ../../3rdparty/stout/include/stout/archiver.hpp:130 > #10 0x555f5e859f06 in extract > (sourcePath="/tmp/mesos/fetch/siri/c3-ace-inspector.tar.gz", > > destinationDirectory="/tmp/mesos/slaves/04f97156-23b7-4411-8fa7-bdec71518221-S1320/frameworks/156b4459-4bb6-460b-89e5-d8c583dee257-0413/executors/cstapper-test-service.simple-pod.test.0.ti9dgngkdceq2_0/runs/4a2a188e-54ef-4"...) > at ../../src/launcher/fetcher.cpp:86 > {noformat} > {noformat} > (gdb) p (struct archive_string_conv > *)archive_string_conversion_to_charset(entry->archive, "UTF-8", 1) > $1 = (struct archive_string_conv *) 0x7fe599cd2be0 > (gdb) p >ae_pathname > $2 = (struct archive_mstring *) 0x7fe599c48010 > (gdb) p (int)archive_strncpy_l(&($2->aes_utf8), $2->aes_mbs.s, > $2->aes_mbs.length, $1) > $3 = -1 > {noformat} > So archive_strncpy_l() fails with -1. best_effort_strncat_in_locale() has > this wonky-looking code: > {noformat} > 2235 remaining = length; > 2236 itp = (const uint8_t *)_p; > 2237 while (*itp && remaining > 0) { > 2238 if (*itp > 127) { > 2239 // Non-ASCII: Substitute with suitable replacement > 2240 if (sc->flag & SCONV_TO_UTF8) { > 2241 if (archive_string_append(as, utf8_replacement_char, > sizeof(utf8_replacement_char)) == NULL) { > 2242 __archive_errx(1, "Out of memory"); > 2243 } > 2244 } else { > 2245 archive_strappend_char(as, '?'); > 2246 } > 2247 return_value = -1; > 2248 } else { > 2249 archive_strappend_char(as, *itp); > 2250 } > 2251 ++itp; > 2252 } > (gdb) break best_effort_strncat_in_locale > Breakpoint 2 at 0x56143c85ff70: file libarchive/archive_string.c, line 2213. > (gdb) p (int)archive_strncpy_l(&($2->aes_utf8), $2->aes_mbs.s, > $2->aes_mbs.length, $1) > ... > (gdb) > 2237 while (*itp && remaining > 0) { > (gdb) > 2238 if (*itp > 127) { > (gdb) > 2240 if (sc->flag & SCONV_TO_UTF8) { > (gdb) > 2241 if (archive_string_append(as, > utf8_replacement_char, sizeof(utf8_replacement_char)) == NULL) { > (gdb) > 2251 ++itp; > (gdb) > 2237 while (*itp && remaining > 0) { > (gdb) > 2247 return_value = -1; > (gdb) p *itp > $5 = 195 '\303' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9393) Fetcher crashes extracting archives with non-ASCII filenames.
James Peach created MESOS-9393: -- Summary: Fetcher crashes extracting archives with non-ASCII filenames. Key: MESOS-9393 URL: https://issues.apache.org/jira/browse/MESOS-9393 Project: Mesos Issue Type: Bug Components: fetcher Reporter: James Peach {noformat} (gdb) bt #0 0x7f2ec3827925 in raise () from /lib64/libc.so.6 #1 0x7f2ec3829105 in abort () from /lib64/libc.so.6 #2 0x7f2ec3e5da5d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6 #3 0x7f2ec3e5bbe6 in ?? () from /usr/lib64/libstdc++.so.6 #4 0x7f2ec3e5bc13 in std::terminate() () from /usr/lib64/libstdc++.so.6 #5 0x7f2ec3e5bd0e in __cxa_throw () from /usr/lib64/libstdc++.so.6 #6 0x7f2ec3e00837 in std::__throw_logic_error(char const*) () from /usr/lib64/libstdc++.so.6 #7 0x7f2ec3e3be59 in ?? () from /usr/lib64/libstdc++.so.6 #8 0x7f2ec3e3bf33 in std::basic_string, std::allocator >::basic_string(char const*, std::allocator const&) () from /usr/lib64/libstdc++.so.6 #9 0x555f5e843a6d in archiver::extract (source=..., destination="/tmp/mesos/slaves/04f97156-23b7-4411-8fa7-bdec71518221-S1320/frameworks/156b4459-4bb6-460b-89e5-d8c583dee257-0413/executors/cstapper-test-service.simple-pod.test.0.ti9dgngkdceq2_0/runs/4a2a188e-54ef-4"..., flags=) at ../../3rdparty/stout/include/stout/archiver.hpp:130 #10 0x555f5e859f06 in extract (sourcePath="/tmp/mesos/fetch/siri/c3-ace-inspector.tar.gz", destinationDirectory="/tmp/mesos/slaves/04f97156-23b7-4411-8fa7-bdec71518221-S1320/frameworks/156b4459-4bb6-460b-89e5-d8c583dee257-0413/executors/cstapper-test-service.simple-pod.test.0.ti9dgngkdceq2_0/runs/4a2a188e-54ef-4"...) at ../../src/launcher/fetcher.cpp:86 {noformat} {noformat} (gdb) p (struct archive_string_conv *)archive_string_conversion_to_charset(entry->archive, "UTF-8", 1) $1 = (struct archive_string_conv *) 0x7fe599cd2be0 (gdb) p >ae_pathname $2 = (struct archive_mstring *) 0x7fe599c48010 (gdb) p (int)archive_strncpy_l(&($2->aes_utf8), $2->aes_mbs.s, $2->aes_mbs.length, $1) $3 = -1 {noformat} So archive_strncpy_l() fails with -1. best_effort_strncat_in_locale() has this wonky-looking code: {noformat} 2235 remaining = length; 2236 itp = (const uint8_t *)_p; 2237 while (*itp && remaining > 0) { 2238 if (*itp > 127) { 2239 // Non-ASCII: Substitute with suitable replacement 2240 if (sc->flag & SCONV_TO_UTF8) { 2241 if (archive_string_append(as, utf8_replacement_char, sizeof(utf8_replacement_char)) == NULL) { 2242 __archive_errx(1, "Out of memory"); 2243 } 2244 } else { 2245 archive_strappend_char(as, '?'); 2246 } 2247 return_value = -1; 2248 } else { 2249 archive_strappend_char(as, *itp); 2250 } 2251 ++itp; 2252 } (gdb) break best_effort_strncat_in_locale Breakpoint 2 at 0x56143c85ff70: file libarchive/archive_string.c, line 2213. (gdb) p (int)archive_strncpy_l(&($2->aes_utf8), $2->aes_mbs.s, $2->aes_mbs.length, $1) ... (gdb) 2237while (*itp && remaining > 0) { (gdb) 2238if (*itp > 127) { (gdb) 2240if (sc->flag & SCONV_TO_UTF8) { (gdb) 2241if (archive_string_append(as, utf8_replacement_char, sizeof(utf8_replacement_char)) == NULL) { (gdb) 2251++itp; (gdb) 2237while (*itp && remaining > 0) { (gdb) 2247return_value = -1; (gdb) p *itp $5 = 195 '\303' {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9367) GetContainers call crashes when using XFS disk isolation.
James Peach created MESOS-9367: -- Summary: GetContainers call crashes when using XFS disk isolation. Key: MESOS-9367 URL: https://issues.apache.org/jira/browse/MESOS-9367 Project: Mesos Issue Type: Bug Components: agent Reporter: James Peach Assignee: James Peach Here's the check failure: {noformat} F1031 20:30:33.246723 3435208 evolve.cpp:736] Check failed: '::protobuf::parse(resource_statistics.get())' Must be SOME: Missing required fields: disk_statistics[0].source.type {noformat} The JSON that is being rendered into protobufs is: {noformat} "disk_statistics": [ { "limit_bytes": 41943040, "persistence": { "id": "7461819b-b0bf-42fc-aa9e-f9958c545523", "principal": "jarvis-principal" }, "source": {}, "used_bytes": 25006080 } ], {noformat} Note the empty "source" element, which triggers the protobuf conversion failure. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9319) Move root filesystem creation to the `filesystem/linux` isolator.
[ https://issues.apache.org/jira/browse/MESOS-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667915#comment-16667915 ] James Peach commented on MESOS-9319: Retitling, based on a sightly expanded scope from review feedback. Rather than just building /dev in the Linux filesystem isolator, we are going to build the whole root filesystem. | [r/69086|https://reviews.apache.org/r/69086] | Moved container root construction to the isolators. | | [r/69211|https://reviews.apache.org/r/69211] | Improved the code comments for `getContainerDevicesPath`. | | [r/69210|https://reviews.apache.org/r/69210] | Used the MS_SILENT mount flag to elide unwanted logging. | > Move root filesystem creation to the `filesystem/linux` isolator. > - > > Key: MESOS-9319 > URL: https://issues.apache.org/jira/browse/MESOS-9319 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: James Peach >Assignee: James Peach >Priority: Major > > When using a custom user namespace isolator, the task fails at launch because > opening devices fails with a EPERM error. This problem is described in [this > systemd issue|https://github.com/systemd/systemd/pull/9483] and [this > lxd|https://github.com/lxc/lxd/issues/4950] issue. > The problem arises in the Mesos containerizer due to the order of operations: > # Clone the containerizer with {{CLONE_NEWNS}} > # Mount a tmpfs for the devices > # mknod for the various device nodes > Referring back to the lxc issue, because we do (1) before (2), the tmpfs on > {{/dev}} is marked {{SB_I_NODEV}}. Due to the new 4.18 behavior, the mkdir in > (3) now succeeds (see commit > [55956b59df33|https://github.com/torvalds/linux/commit/55956b59df336f6738da916dbb520b6e37df9fbd]). > Previously it would fail and we would fall back to bind mounting the device. > However, even though we created the device, we can't actually open it due to > the {{SB_I_NODEV}} flag on the tmpfs mount. It appears that the purpose of > allowing mknod is to that containers can create overlayfs whiteouts. > One approach to deal with this in the Mesos containerizer is to complete the > device node cleanup that was begun in with the linux/devices isolator. This > approach involves moving all the responsibility for creating devices back to > the isolators. Then, at containerization time, we simply bind-mount the whole > of /dev from the per-container staging area. Since the isolators create the > devices in the host namespace and on the Mesos work directory, none of the > conditions that trigger the failure would be invoked. > The failure we observed with our tasks was a failure to open {{/dev/null}}, > when redirecting it as standard input to a child process. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9361) CgroupsIsolatorTest.ROOT_CGROUPS_CreateRecursively always fails.
James Peach created MESOS-9361: -- Summary: CgroupsIsolatorTest.ROOT_CGROUPS_CreateRecursively always fails. Key: MESOS-9361 URL: https://issues.apache.org/jira/browse/MESOS-9361 Project: Mesos Issue Type: Bug Components: flaky, test Reporter: James Peach On Fedora 28: {noformat} [ RUN ] CgroupsIsolatorTest.ROOT_CGROUPS_CreateRecursively I1029 09:38:31.866564 31397 cgroups.cpp:2838] Freezing cgroup /sys/fs/cgroup/freezer/mesos_test_62e0c540-832e-4601-8658-7faa25c427ce I1029 09:38:31.867048 31398 cgroups.cpp:1229] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos_test_62e0c540-832e-4601-8658-7faa25c427ce after 359936ns I1029 09:38:31.869033 31397 cgroups.cpp:2856] Thawing cgroup /sys/fs/cgroup/freezer/mesos_test_62e0c540-832e-4601-8658-7faa25c427ce I1029 09:38:31.869357 31403 cgroups.cpp:1258] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos_test_62e0c540-832e-4601-8658-7faa25c427ce after 261888ns I1029 09:38:31.884752 31382 cluster.cpp:173] Creating default 'local' authorizer I1029 09:38:31.892966 31397 master.cpp:413] Master 0b04a175-fe62-41a1-a387-8d679d1d9609 (jpeach.scv.apple.com) started on 17.228.8.72:42153 I1029 09:38:31.892992 31397 master.cpp:416] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="hierarchical" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/mFB69h/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --publish_per_framework_metrics="true" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/mFB69h/master" --zk_session_timeout="10secs" I1029 09:38:31.893931 31397 master.cpp:465] Master only allowing authenticated frameworks to register I1029 09:38:31.893942 31397 master.cpp:471] Master only allowing authenticated agents to register I1029 09:38:31.893951 31397 master.cpp:477] Master only allowing authenticated HTTP frameworks to register I1029 09:38:31.893962 31397 credentials.hpp:37] Loading credentials for authentication from '/tmp/mFB69h/credentials' I1029 09:38:31.894204 31397 master.cpp:521] Using default 'crammd5' authenticator I1029 09:38:31.894359 31397 authenticator.cpp:520] Initializing server SASL I1029 09:38:31.898878 31397 auxprop.cpp:73] Initialized in-memory auxiliary property plugin I1029 09:38:31.898983 31397 http.cpp:1038] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' I1029 09:38:31.899279 31397 http.cpp:1038] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' I1029 09:38:31.899395 31397 http.cpp:1038] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' I1029 09:38:31.899507 31397 master.cpp:602] Authorization enabled I1029 09:38:31.900339 31406 whitelist_watcher.cpp:77] No whitelist given I1029 09:38:31.900434 31400 hierarchical.cpp:175] Initialized hierarchical allocator process I1029 09:38:31.908254 31403 master.cpp:2105] Elected as the leading master! I1029 09:38:31.908313 31403 master.cpp:1660] Recovering from registrar I1029 09:38:31.908717 31404 registrar.cpp:339] Recovering registrar I1029 09:38:31.910310 31400 registrar.cpp:383] Successfully fetched the registry (0B) in 1.547776ms I1029 09:38:31.910684 31400 registrar.cpp:487] Applied 1 operations in 150793ns; attempting to update the registry I1029 09:38:31.913811 31400 registrar.cpp:544] Successfully updated the registry in 2.979072ms I1029 09:38:31.914028 31400 registrar.cpp:416] Successfully recovered registrar I1029 09:38:31.914872 31398 master.cpp:1774] Recovered 0 agents from the registry (154B); allowing 10mins for agents to reregister I1029 09:38:31.914912 31406 hierarchical.cpp:215] Skipping recovery of hierarchical allocator: nothing to recover I1029 09:38:31.920753 31382
[jira] [Assigned] (MESOS-9354) Automatically remount read-only bind mounts.
[ https://issues.apache.org/jira/browse/MESOS-9354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-9354: -- Assignee: James Peach > Automatically remount read-only bind mounts. > > > Key: MESOS-9354 > URL: https://issues.apache.org/jira/browse/MESOS-9354 > Project: Mesos > Issue Type: Bug > Components: agent, containerization >Reporter: James Peach >Assignee: James Peach >Priority: Minor > > To make a bind mount read-only, you have to first make the bind mount, then > remount it with the read-only flag. This is a bit arcane, which is why > mount(8) does it automatically. We should also do it automatically in > {{fs::mount}} so that every caller doesn't have to carry special code to make > it work correctly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9354) Automatically remount read-only bind mounts.
[ https://issues.apache.org/jira/browse/MESOS-9354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662995#comment-16662995 ] James Peach commented on MESOS-9354: /cc [~jieyu] > Automatically remount read-only bind mounts. > > > Key: MESOS-9354 > URL: https://issues.apache.org/jira/browse/MESOS-9354 > Project: Mesos > Issue Type: Bug > Components: agent, containerization >Reporter: James Peach >Priority: Minor > > To make a bind mount read-only, you have to first make the bind mount, then > remount it with the read-only flag. This is a bit arcane, which is why > mount(8) does it automatically. We should also do it automatically in > {{fs::mount}} so that every caller doesn't have to carry special code to make > it work correctly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9354) Automatically remount read-only bind mounts.
James Peach created MESOS-9354: -- Summary: Automatically remount read-only bind mounts. Key: MESOS-9354 URL: https://issues.apache.org/jira/browse/MESOS-9354 Project: Mesos Issue Type: Bug Components: agent, containerization Reporter: James Peach To make a bind mount read-only, you have to first make the bind mount, then remount it with the read-only flag. This is a bit arcane, which is why mount(8) does it automatically. We should also do it automatically in {{fs::mount}} so that every caller doesn't have to carry special code to make it work correctly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9349) Prevent ptracing of container management processes.
[ https://issues.apache.org/jira/browse/MESOS-9349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661330#comment-16661330 ] James Peach commented on MESOS-9349: The plan here is to add an agent flag for operator visibility (probably the default should be enabled, so we improve security by default). We can examine the flag in the linux launcher, but from then on we can just sample and propagate the current setting. > Prevent ptracing of container management processes. > --- > > Key: MESOS-9349 > URL: https://issues.apache.org/jira/browse/MESOS-9349 > Project: Mesos > Issue Type: Bug > Components: containerization, security >Reporter: James Peach >Priority: Major > > The container launcher and the built-in executors are (at least partially) > accessible to containerized user tasks. Since these processes may contain > secrets or hold privileged resources, we can increase the difficulty of > attacking them by preventing user tasks attaching to them with ptrace(2). > This amounts to calling `prctl(PR_SET_DUMPABLE, 0)`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9349) Prevent ptracing of container management processes.
James Peach created MESOS-9349: -- Summary: Prevent ptracing of container management processes. Key: MESOS-9349 URL: https://issues.apache.org/jira/browse/MESOS-9349 Project: Mesos Issue Type: Bug Components: containerization, security Reporter: James Peach The container launcher and the built-in executors are (at least partially) accessible to containerized user tasks. Since these processes may contain secrets or hold privileged resources, we can increase the difficulty of attacking them by preventing user tasks attaching to them with ptrace(2). This amounts to calling `prctl(PR_SET_DUMPABLE, 0)`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9348) URL-encoded HDFS artifacts can't be fetched through the cache.
[ https://issues.apache.org/jira/browse/MESOS-9348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16659865#comment-16659865 ] James Peach commented on MESOS-9348: One approach here is to URL-encode the output filename for the HDFS command. Experimentally, it looks like this is required, since the command errors out on unsafe characters: {noformat} # hdfs dfs -copyToLocal hdfs:///artifacts/8c/99/4b/8c994b489674589a58805e2e695e98674b9dd793411579f0fbaea3459f94f86e/connector/%5BRELEASE%5D/connector-%5BRELEASE%5D.jar $(pwd)/%255B-jpeach-].jar copyToLocal: unexpected URISyntaxException {noformat} > URL-encoded HDFS artifacts can't be fetched through the cache. > -- > > Key: MESOS-9348 > URL: https://issues.apache.org/jira/browse/MESOS-9348 > Project: Mesos > Issue Type: Bug > Components: fetcher >Reporter: James Peach >Priority: Major > > The {{hdfs dfs}} command always does a URI decode on the target output file. > This means that the output file gets stored in the fetcher cache under the > wrong filename and we can never retrieve it. > Here's an example of how the command behaves: > {noformat} > [/tmp]# hdfs dfs -copyToLocal > hdfs:///artifacts/8c/99/4b/8c994b489674589a58805e2e695e98674b9dd793411579f0fbaea3459f94f86e/connector/%5BRELEASE%5D/connector-%5BRELEASE%5D.jar > $(pwd)/%5B-jpeach-%5D.jar > [/tmp]# ls -l *jpeach* > -rw-r--r-- 1 root root 7285799 Oct 22 23:29 [-jpeach-].jar > {noformat} > Here's how this plays out in the fetcher: > {noformat} > W1022 23:22:13.649587 3186459 fetcher.cpp:395] Copying instead of extracting > resource from URI with 'extract' flag, because it does not seem to be an > archive: > hdfs:///artifacts/8c/99/4b/8c994b489674589a58805e2e695e98674b9dd793411579f0fbaea3459f94f86e/connector/%5BRELEASE%5D/connector-%5BRELEASE%5D.jar > cp: cannot stat `/srv/mesos/fetch/jarvis/c67-connector-_ASE%5D.jar': No such > file or directory > E1022 23:22:13.652987 3186459 fetcher.cpp:613] EXIT with status 1: Failed to > fetch > 'hdfs:///artifacts/8c/99/4b/8c994b489674589a58805e2e695e98674b9dd793411579f0fbaea3459f94f86e/connector/%5BRELEASE%5D/connector-%5BRELEASE%5D.jar': > cp failed with status: 256 > ... > # ls -latr /srv/mesos/fetch > ... > -rw-r--r-- 1 jarvis jarvis 7285799 Oct 22 23:22 c67-connector-_ASE].jar > {noformat} > The fetcher has downloaded the artifact into the cache, but can't copy it > into the sandbox because it was downloaded to the wrong filename. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9348) URL-encoded HDFS artifacts can't be fetched through the cache.
James Peach created MESOS-9348: -- Summary: URL-encoded HDFS artifacts can't be fetched through the cache. Key: MESOS-9348 URL: https://issues.apache.org/jira/browse/MESOS-9348 Project: Mesos Issue Type: Bug Components: fetcher Reporter: James Peach The {{hdfs dfs}} command always does a URI decode on the target output file. This means that the output file gets stored in the fetcher cache under the wrong filename and we can never retrieve it. Here's an example of how the command behaves: {noformat} [/tmp]# hdfs dfs -copyToLocal hdfs:///artifacts/8c/99/4b/8c994b489674589a58805e2e695e98674b9dd793411579f0fbaea3459f94f86e/connector/%5BRELEASE%5D/connector-%5BRELEASE%5D.jar $(pwd)/%5B-jpeach-%5D.jar [/tmp]# ls -l *jpeach* -rw-r--r-- 1 root root 7285799 Oct 22 23:29 [-jpeach-].jar {noformat} Here's how this plays out in the fetcher: {noformat} W1022 23:22:13.649587 3186459 fetcher.cpp:395] Copying instead of extracting resource from URI with 'extract' flag, because it does not seem to be an archive: hdfs:///artifacts/8c/99/4b/8c994b489674589a58805e2e695e98674b9dd793411579f0fbaea3459f94f86e/connector/%5BRELEASE%5D/connector-%5BRELEASE%5D.jar cp: cannot stat `/srv/mesos/fetch/jarvis/c67-connector-_ASE%5D.jar': No such file or directory E1022 23:22:13.652987 3186459 fetcher.cpp:613] EXIT with status 1: Failed to fetch 'hdfs:///artifacts/8c/99/4b/8c994b489674589a58805e2e695e98674b9dd793411579f0fbaea3459f94f86e/connector/%5BRELEASE%5D/connector-%5BRELEASE%5D.jar': cp failed with status: 256 ... # ls -latr /srv/mesos/fetch ... -rw-r--r-- 1 jarvis jarvis 7285799 Oct 22 23:22 c67-connector-_ASE].jar {noformat} The fetcher has downloaded the artifact into the cache, but can't copy it into the sandbox because it was downloaded to the wrong filename. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9319) Create all container devices at isolation time.
[ https://issues.apache.org/jira/browse/MESOS-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652677#comment-16652677 ] James Peach commented on MESOS-9319: Prototype code looks promising. Currently, /dev is a tmpfs, but in this proposal it would be a bind mount to a real filesystem. I'm binding it in read-only to prevent disk quota escapes, which seems to work OK. > Create all container devices at isolation time. > --- > > Key: MESOS-9319 > URL: https://issues.apache.org/jira/browse/MESOS-9319 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: James Peach >Assignee: James Peach >Priority: Major > > When using a custom user namespace isolator, the task fails at launch because > opening devices fails with a EPERM error. This problem is described in [this > system issue|https://github.com/systemd/systemd/pull/9483] and [this > lxd|https://github.com/lxc/lxd/issues/4950] issue. > The problem arises in the Mesos containerizer due to the order of operations: > # Clone the containerizer with {{CLONE_NEWNS}} > # Mount a tmpfs for the devices > # mknod for the various device nodes > Referring back to the lxc issue, because we do (1) before (2), the tmpfs on > {{/dev}} is marked {{SB_I_NODEV}}. Due to the new 4.18 behavior, the mkdir in > (3) now succeeds (see commit > [55956b59df33|https://github.com/torvalds/linux/commit/55956b59df336f6738da916dbb520b6e37df9fbd]). > Previously it would fail and we would fall back to bind mounting the device. > However, even though we created the device, we can't actually open it due to > the {{SB_I_NODEV}} flag on the tmpfs mount. It appears that the purpose of > allowing mknod is to that containers can create overlayfs whiteouts. > One approach to deal with this in the Mesos containerizer is to complete the > device node cleanup that was begun in with the linux/devices isolator. This > approach involves moving all the responsibility for creating devices back to > the isolators. Then, at containerization time, we simply bind-mount the whole > of /dev from the per-container staging area. Since the isolators create the > devices in the host namespace and on the Mesos work directory, none of the > conditions that trigger the failure would be invoked. > The failure we observed with our tasks was a failure to open {{/dev/null}}, > when redirecting it as standard input to a child process. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9302) Mesos fails to build on Fedora 28
[ https://issues.apache.org/jira/browse/MESOS-9302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651914#comment-16651914 ] James Peach commented on MESOS-9302: Upstream cares fix is [#209|https://github.com/c-ares/c-ares/pull/209] > Mesos fails to build on Fedora 28 > - > > Key: MESOS-9302 > URL: https://issues.apache.org/jira/browse/MESOS-9302 > Project: Mesos > Issue Type: Bug > Environment: gcc (GCC) 8.1.1 20180712 (Red Hat 8.1.1-5) > Fedora 28 >Reporter: Benno Evers >Priority: Major > Labels: build-failure > > Trying to compile a fresh Mesos checkout on a Fedora 28 system with the > following configuration flags: > {noformat} > ../configure --enable-debug --enable-optimize --disable-java --disable-python > --disable-libtool-wrappers --enable-ssl --enable-libevent --disable-werror > {noformat} > and the following compiler > {noformat} > [bev...@core1.hw.ca1 build]$ gcc --version > gcc (GCC) 8.1.1 20180712 (Red Hat 8.1.1-5) > Copyright (C) 2018 Free Software Foundation, Inc. > This is free software; see the source for copying conditions. There is NO > warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. > {noformat} > fails the build due to two warnings (even though --disable-werror was passed): > {noformat} > make[4]: Entering directory '/home/bevers/mesos/build/3rdparty/grpc-1.10.0' > [C] Compiling third_party/cares/cares/ares_init.c > third_party/cares/cares/ares_init.c: In function ‘ares_dup’: > third_party/cares/cares/ares_init.c:301:17: error: argument to ‘sizeof’ in > ‘strncpy’ call is the same expression as the source; did you mean to use the > size of the destination? [-Werror=sizeof-pointer-memaccess] >sizeof(src->local_dev_name)); > ^ > third_party/cares/cares/ares_init.c: At top level: > cc1: error: unrecognized command line option ‘-Wno-invalid-source-encoding’ > [-Werror] > cc1: all warnings being treated as errors > make[4]: *** [Makefile:2635: > /home/bevers/mesos/build/3rdparty/grpc-1.10.0/objs/opt/third_party/cares/cares/ares_init.o] > Error 1 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-9302) Mesos fails to build on Fedora 28
[ https://issues.apache.org/jira/browse/MESOS-9302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651914#comment-16651914 ] James Peach edited comment on MESOS-9302 at 10/16/18 3:21 PM: -- Upstream c-ares fix is [#209|https://github.com/c-ares/c-ares/pull/209] was (Author: jamespeach): Upstream cares fix is [#209|https://github.com/c-ares/c-ares/pull/209] > Mesos fails to build on Fedora 28 > - > > Key: MESOS-9302 > URL: https://issues.apache.org/jira/browse/MESOS-9302 > Project: Mesos > Issue Type: Bug > Environment: gcc (GCC) 8.1.1 20180712 (Red Hat 8.1.1-5) > Fedora 28 >Reporter: Benno Evers >Priority: Major > Labels: build-failure > > Trying to compile a fresh Mesos checkout on a Fedora 28 system with the > following configuration flags: > {noformat} > ../configure --enable-debug --enable-optimize --disable-java --disable-python > --disable-libtool-wrappers --enable-ssl --enable-libevent --disable-werror > {noformat} > and the following compiler > {noformat} > [bev...@core1.hw.ca1 build]$ gcc --version > gcc (GCC) 8.1.1 20180712 (Red Hat 8.1.1-5) > Copyright (C) 2018 Free Software Foundation, Inc. > This is free software; see the source for copying conditions. There is NO > warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. > {noformat} > fails the build due to two warnings (even though --disable-werror was passed): > {noformat} > make[4]: Entering directory '/home/bevers/mesos/build/3rdparty/grpc-1.10.0' > [C] Compiling third_party/cares/cares/ares_init.c > third_party/cares/cares/ares_init.c: In function ‘ares_dup’: > third_party/cares/cares/ares_init.c:301:17: error: argument to ‘sizeof’ in > ‘strncpy’ call is the same expression as the source; did you mean to use the > size of the destination? [-Werror=sizeof-pointer-memaccess] >sizeof(src->local_dev_name)); > ^ > third_party/cares/cares/ares_init.c: At top level: > cc1: error: unrecognized command line option ‘-Wno-invalid-source-encoding’ > [-Werror] > cc1: all warnings being treated as errors > make[4]: *** [Makefile:2635: > /home/bevers/mesos/build/3rdparty/grpc-1.10.0/objs/opt/third_party/cares/cares/ares_init.o] > Error 1 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9319) Create all container devices at isolation time.
[ https://issues.apache.org/jira/browse/MESOS-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-9319: -- Assignee: James Peach > Create all container devices at isolation time. > --- > > Key: MESOS-9319 > URL: https://issues.apache.org/jira/browse/MESOS-9319 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: James Peach >Assignee: James Peach >Priority: Major > > When using a custom user namespace isolator, the task fails at launch because > opening devices fails with a EPERM error. This problem is described in [this > system issue|https://github.com/systemd/systemd/pull/9483] and [this > lxd|https://github.com/lxc/lxd/issues/4950] issue. > The problem arises in the Mesos containerizer due to the order of operations: > # Clone the containerizer with {{CLONE_NEWNS}} > # Mount a tmpfs for the devices > # mknod for the various device nodes > Referring back to the lxc issue, because we do (1) before (2), the tmpfs on > {{/dev}} is marked {{SB_I_NODEV}}. Due to the new 4.18 behavior, the mkdir in > (3) now succeeds (see commit > [55956b59df33|https://github.com/torvalds/linux/commit/55956b59df336f6738da916dbb520b6e37df9fbd]). > Previously it would fail and we would fall back to bind mounting the device. > However, even though we created the device, we can't actually open it due to > the {{SB_I_NODEV}} flag on the tmpfs mount. It appears that the purpose of > allowing mknod is to that containers can create overlayfs whiteouts. > One approach to deal with this in the Mesos containerizer is to complete the > device node cleanup that was begun in with the linux/devices isolator. This > approach involves moving all the responsibility for creating devices back to > the isolators. Then, at containerization time, we simply bind-mount the whole > of /dev from the per-container staging area. Since the isolators create the > devices in the host namespace and on the Mesos work directory, none of the > conditions that trigger the failure would be invoked. > The failure we observed with our tasks was a failure to open {{/dev/null}}, > when redirecting it as standard input to a child process. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Issue Comment Deleted] (MESOS-9319) Create all container devices at isolation time
[ https://issues.apache.org/jira/browse/MESOS-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach updated MESOS-9319: --- Comment: was deleted (was: When using a custom user namespace isolator, the task fails at launch because opening devices fails with a {{EPERM}} error. This problem is described in [this system issue|https://github.com/systemd/systemd/pull/9483] and this [lxd issue|https://github.com/lxc/lxd/issues/4950]. The problem arises in the Mesos containerizer due to the order of operations: # Clone the containerizer with CLONE_NEWNS # Mount a tmpfs for the devices # mknod for the various device nodes Referring back to the lxc issue, because we do (1) before (2), the tmpfs on /dev is marked SB_I_NODEV. Due to the new 4.18 behavior, the mkdir in (3) now succeeds (see commit [55956b59df33|https://github.com/torvalds/linux/commit/55956b59df336f6738da916dbb520b6e37df9fbd]). Previously it would fail and we would fall back to bind mounting the device. However, even though we created the device, we can't actually open it due to the SB_I_NODEV flag on the tmpfs mount. It appears that the purpose of allowing mknod is to that containers can create overlayfs whiteouts. One approach to deal with this in the Mesos containerizer is to complete the device node cleanup that was begun in with the linux/devices isolator. This approach involves moving all the responsibility for creating devices back to the isolators. Then, at containerization time, we simply bind-mount the whole of /dev from the per-container staging area. Since the isolators create the devices in the host namespace and on the Mesos work directory, none of the conditions that trigger the failure would be invoked. The failure we observed with our tasks was a failure to open {{/dev/null}}, when redirecting it as standard input to a child process.) > Create all container devices at isolation time > -- > > Key: MESOS-9319 > URL: https://issues.apache.org/jira/browse/MESOS-9319 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: James Peach >Priority: Major > > When using a custom user namespace isolator, the task fails at launch because > opening devices fails with a EPERM error. This problem is described in this > system issue and this lxd issue. > The problem arises in the Mesos containerizer due to the order of operations: > Clone the containerizer with CLONE_NEWNS > Mount a tmpfs for the devices > mknod for the various device nodes > Referring back to the lxc issue, because we do (1) before (2), the tmpfs on > /dev is marked SB_I_NODEV. Due to the new 4.18 behavior, the mkdir in (3) now > succeeds (see commit 55956b59df33). Previously it would fail and we would > fall back to bind mounting the device. However, even though we created the > device, we can't actually open it due to the SB_I_NODEV flag on the tmpfs > mount. It appears that the purpose of allowing mknod is to that containers > can create overlayfs whiteouts. > One approach to deal with this in the Mesos containerizer is to complete the > device node cleanup that was begun in with the linux/devices isolator. This > approach involves moving all the responsibility for creating devices back to > the isolators. Then, at containerization time, we simply bind-mount the whole > of /dev from the per-container staging area. Since the isolators create the > devices in the host namespace and on the Mesos work directory, none of the > conditions that trigger the failure would be invoked. > The failure we observed with our tasks was a failure to open /dev/null, when > redirecting it as standard input to a child process. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-9319) Create all container devices at isolation time
[ https://issues.apache.org/jira/browse/MESOS-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650806#comment-16650806 ] James Peach edited comment on MESOS-9319 at 10/15/18 9:18 PM: -- When using a custom user namespace isolator, the task fails at launch because opening devices fails with a {{EPERM}} error. This problem is described in [this system issue|https://github.com/systemd/systemd/pull/9483] and this [lxd issue|https://github.com/lxc/lxd/issues/4950]. The problem arises in the Mesos containerizer due to the order of operations: # Clone the containerizer with CLONE_NEWNS # Mount a tmpfs for the devices # mknod for the various device nodes Referring back to the lxc issue, because we do (1) before (2), the tmpfs on /dev is marked SB_I_NODEV. Due to the new 4.18 behavior, the mkdir in (3) now succeeds (see commit [55956b59df33|https://github.com/torvalds/linux/commit/55956b59df336f6738da916dbb520b6e37df9fbd]). Previously it would fail and we would fall back to bind mounting the device. However, even though we created the device, we can't actually open it due to the SB_I_NODEV flag on the tmpfs mount. It appears that the purpose of allowing mknod is to that containers can create overlayfs whiteouts. One approach to deal with this in the Mesos containerizer is to complete the device node cleanup that was begun in with the linux/devices isolator. This approach involves moving all the responsibility for creating devices back to the isolators. Then, at containerization time, we simply bind-mount the whole of /dev from the per-container staging area. Since the isolators create the devices in the host namespace and on the Mesos work directory, none of the conditions that trigger the failure would be invoked. The failure we observed with our tasks was a failure to open {{/dev/null}}, when redirecting it as standard input to a child process. was (Author: jamespeach): When using a custom user namespace isolator, the task fails at launch because opening devices fails with a {{EPERM}} error. This problem is described in [this system issue|https://github.com/systemd/systemd/pull/9483] and this [lxd issue|https://github.com/lxc/lxd/issues/4950]. The problem arises in the Mesos containerizer due to the order of operations: # Clone the containerizer with CLONE_NEWNS # Mount a tmpfs for the devices # mknod for the various device nodes Referring back to the lxc issue, because we do (1) before (2), the tmpfs on /dev is marked SB_I_NODEV. Due to the new 4.18 behavior, the mkdir in (3) now succeeds (see commit [55956b59df33|https://github.com/torvalds/linux/commit/55956b59df336f6738da916dbb520b6e37df9fbd]). Previously it would fail and we would fall back to bind mounting the device. However, even though we created the device, we can't actually open it due to the SB_I_NODEV flag on the tmpfs mount. It appears that the purpose of allowing mknod is to that containers can create overlayfs whiteouts. One approach to deal with this in the Mesos containerizer is to complete the device node cleanup that was begun in with the linux/devices isolator. This approach involves moving all the responsibility for creating devices back to the isolators. Then, at containerization time, we simply bind-mount the whole of /dev from the per-container staging area. Since the isolators create the devices in the host namespace and on the Mesos work directory, none of the conditions that trigger the failure would be invoked. > Create all container devices at isolation time > -- > > Key: MESOS-9319 > URL: https://issues.apache.org/jira/browse/MESOS-9319 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: James Peach >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9319) Create all container devices at isolation time
[ https://issues.apache.org/jira/browse/MESOS-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650806#comment-16650806 ] James Peach commented on MESOS-9319: When using a custom user namespace isolator, the task fails at launch because opening devices fails with a {{EPERM}} error. This problem is described in [this system issue|https://github.com/systemd/systemd/pull/9483] and this [lxd issue|https://github.com/lxc/lxd/issues/4950]. The problem arises in the Mesos containerizer due to the order of operations: # Clone the containerizer with CLONE_NEWNS # Mount a tmpfs for the devices # mknod for the various device nodes Referring back to the lxc issue, because we do (1) before (2), the tmpfs on /dev is marked SB_I_NODEV. Due to the new 4.18 behavior, the mkdir in (3) now succeeds (see commit [55956b59df33|https://github.com/torvalds/linux/commit/55956b59df336f6738da916dbb520b6e37df9fbd]). Previously it would fail and we would fall back to bind mounting the device. However, even though we created the device, we can't actually open it due to the SB_I_NODEV flag on the tmpfs mount. It appears that the purpose of allowing mknod is to that containers can create overlayfs whiteouts. One approach to deal with this in the Mesos containerizer is to complete the device node cleanup that was begun in with the linux/devices isolator. This approach involves moving all the responsibility for creating devices back to the isolators. Then, at containerization time, we simply bind-mount the whole of /dev from the per-container staging area. Since the isolators create the devices in the host namespace and on the Mesos work directory, none of the conditions that trigger the failure would be invoked. > Create all container devices at isolation time > -- > > Key: MESOS-9319 > URL: https://issues.apache.org/jira/browse/MESOS-9319 > Project: Mesos > Issue Type: Bug > Components: containerization > Environment: When using a custom user namespace isolator, the task > fails at launch because opening devices fails with a {{EPERM}} error. This > problem is described in [this system > issue|https://github.com/systemd/systemd/pull/9483] and this [lxd > issue|https://github.com/lxc/lxd/issues/4950]. > The problem arises in the Mesos containerizer due to the order of operations: > # Clone the containerizer with CLONE_NEWNS > # Mount a tmpfs for the devices > # mknod for the various device nodes > Referring back to the lxc issue, because we do (1) before (2), the tmpfs on > /dev is marked SB_I_NODEV. Due to the new 4.18 behavior, the mkdir in (3) now > succeeds (see commit > [55956b59df33|https://github.com/torvalds/linux/commit/55956b59df336f6738da916dbb520b6e37df9fbd]). > Previously it would fail and we would fall back to bind mounting the device. > However, even though we created the device, we can't actually open it due to > the SB_I_NODEV flag on the tmpfs mount. It appears that the purpose of > allowing mknod is to that containers can create overlayfs whiteouts. > One approach to deal with this in the Mesos containerizer is to complete the > device node cleanup that was begun in with the linux/devices isolator. This > approach involves moving all the responsibility for creating devices back to > the isolators. Then, at containerization time, we simply bind-mount the whole > of /dev from the per-container staging area. Since the isolators create the > devices in the host namespace and on the Mesos work directory, none of the > conditions that trigger the failure would be invoked. >Reporter: James Peach >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9319) Create all container devices at isolation time
James Peach created MESOS-9319: -- Summary: Create all container devices at isolation time Key: MESOS-9319 URL: https://issues.apache.org/jira/browse/MESOS-9319 Project: Mesos Issue Type: Bug Components: containerization Environment: When using a custom user namespace isolator, the task fails at launch because opening devices fails with a {{EPERM}} error. This problem is described in [this system issue|https://github.com/systemd/systemd/pull/9483] and this [lxd issue|https://github.com/lxc/lxd/issues/4950]. The problem arises in the Mesos containerizer due to the order of operations: # Clone the containerizer with CLONE_NEWNS # Mount a tmpfs for the devices # mknod for the various device nodes Referring back to the lxc issue, because we do (1) before (2), the tmpfs on /dev is marked SB_I_NODEV. Due to the new 4.18 behavior, the mkdir in (3) now succeeds (see commit [55956b59df33|https://github.com/torvalds/linux/commit/55956b59df336f6738da916dbb520b6e37df9fbd]). Previously it would fail and we would fall back to bind mounting the device. However, even though we created the device, we can't actually open it due to the SB_I_NODEV flag on the tmpfs mount. It appears that the purpose of allowing mknod is to that containers can create overlayfs whiteouts. One approach to deal with this in the Mesos containerizer is to complete the device node cleanup that was begun in with the linux/devices isolator. This approach involves moving all the responsibility for creating devices back to the isolators. Then, at containerization time, we simply bind-mount the whole of /dev from the per-container staging area. Since the isolators create the devices in the host namespace and on the Mesos work directory, none of the conditions that trigger the failure would be invoked. Reporter: James Peach -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-8313) Provide a host namespace container supervisor.
[ https://issues.apache.org/jira/browse/MESOS-8313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358625#comment-16358625 ] James Peach edited comment on MESOS-8313 at 10/15/18 6:38 PM: -- Note, this supervisor need to reap all its children, as per MESOS-5893. was (Author: jamespeach): Note, this supervisor need to read all its children, as per MESOS-5893. > Provide a host namespace container supervisor. > -- > > Key: MESOS-8313 > URL: https://issues.apache.org/jira/browse/MESOS-8313 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: James Peach >Assignee: James Peach >Priority: Major > Attachments: IMG_2629.JPG > > > After more investigation on user namespaces, the current implementation of > creating the container namespaces needs some adjustment before we can > implement user namespaces in a useable fashion. > The problems we need to address are: > 1. The containerizer needs to hold {{CAP_SYS_ADMIN}} over the PID namespace > to mount {{procfs}}. Currently, this prevents containers joining the host PID > namespace. The workaround is to always create a new container PID namespace > (as a child of the user namespace) with the {{namespaces/pid}} isolator. > 2. The containerizer needs to hold {{CAP_SYS_ADMIN}} over the network > namespace to mount {{sysfs}}. There's no general workaround for this since we > can't generally require containers to not join the host network namespace. > 3. The containerizer can't enter a user namespace after entering the > {{chroot}}. This restriction makes the existing order of containerizer > operations impossible to remain in the case where we want the executor to be > in a new user namespace that has no children (i.e. to protect the container > from a privileged task). > After some discussion with [~jieyu], we believe that we can some most or all > of these issues by creating a new containerized supervisor that runs fully > outside the container and is responsible for constructing the roots mount > namespace, launching the containerized to enter the rest of the container, > and waiting on the entered process. > Since this new supervisor process is not running in the user namespace, it > will be able to construct the container rootfs in a new mount namespace > without user namespace restrictions. We can then clone a child to fully > create and enter container namespaces along with the prefabricated rootfs > mount namespace. > The only drawback to this approach is that the container's mount namespace > will be owned by the root user namespace rather than the container user > namespace. We are OK with this for now. > The plan here is to retain the existing {{mesos-containerizer launch}} > subcommand and add a new {{mesos-containerizer supervise}} subcommand, which > will be its parent process. This new subcommand will be used for the default > executor and custom executor code paths. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9300) XFS isolator can mislabel project IDs on persistence volumes.
[ https://issues.apache.org/jira/browse/MESOS-9300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16642305#comment-16642305 ] James Peach commented on MESOS-9300: MacOS has [ATTR_DIR_MOUNTSTATUS|https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man2/getattrlist.2.html#//apple_ref/doc/man/2/getattrlist], but AFAIK there's not a straight-forward equivalent on Linux. However like we can detect this on Linux with [EXDEV rename trick|http://blog.schmorp.de/2016-03-03-detecting-a-mount-point.html] > XFS isolator can mislabel project IDs on persistence volumes. > - > > Key: MESOS-9300 > URL: https://issues.apache.org/jira/browse/MESOS-9300 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: James Peach >Assignee: James Peach >Priority: Major > > What happens here is that we are erroneously applying the sandbox's project > ID to the persistent volume. > First, the filesystem/linux isolator bind mounts the persistent volume into > the sandbox: > {noformat} > I1003 06:49:21.907644 2812466 linux.cpp:593] Mounting > '/srv/mesos/work/volumes/roles/pie.mobius/21cb2eb6-b3e5-46f2-944e-8f6e5db9f07f' > to > '/srv/mesos/work/slaves/909cff92-8e17-41bf-a251-9b5eb6186c35-S0/frameworks/363e6d80-8c38-46cf-815f-2fbf60a62628-0309/executors/mobius-mloop-1538549013_438156792-v2-shared-volume.pod1.writer-job.0.e93hs3uips2i9_1/runs/9e5770a7-9f78-46dc-9264-3e80be0e40cc/shared' > for persistent volume disk(allocated: pie.mobius)(reservations: > [(DYNAMIC,pie.mobius,jarvis-principal,\{podInstance: e93hs3uips2i9, pod: > pod1, service: > mobius-mloop-1538549013_438156792-v2-shared-volume})])[21cb2eb6-b3e5-46f2-944e-8f6e5db9f07f:shared]:1 > of container 9e5770a7-9f78-46dc-9264-3e80be0e40cc > {noformat} > Next, the `disk/xfs` isolator assigns a project ID to the sandbox: > {noformat} > I1003 06:49:21.920197 2812452 disk.cpp:402] Assigned project 6806 to > '/srv/mesos/work/slaves/909cff92-8e17-41bf-a251-9b5eb6186c35-S0/frameworks/363e6d80-8c38-46cf-815f-2fbf60a62628-0309/executors/mobius-mloop-1538549013_438156792-v2-shared-volume.pod1.writer-job.0.e93hs3uips2i9_1/runs/9e5770a7-9f78-46dc-9264-3e80be0e40cc' > {noformat} > Note, that when this happens, the isolator recursively applies the project ID > to the contents of the sandbox. It doesn't follow symlinks or cross devices > when it does this, but on Linux, a bind mount would not trigger either of > these conditions. > Finally, the `disk/xfs` isolator tries to assign a project ID to the > persistent volume as it is used by the task: > {noformat} > F1003 06:49:21.920577 2812452 disk.cpp:532] Check failed: > scheduledProjects.contains(projectId.get()) untracked project ID 6806 for > volume ID 21cb2eb6-b3e5-46f2-944e-8f6e5db9f07f on > /srv/mesos/work/volumes/roles/pie.mobius/21cb2eb6-b3e5-46f2-944e-8f6e5db9f07f > {noformat} > This check fails, because if the persistent volume has a project ID, we > expect that is had already be scheduled for reclaimation. However, it's > project ID is the one we assigned to the sandbox. We don't scheduled the > ssandbox for reclaimation until cleanup, so (fortunately) the invariant check > triggers. > So, apart from triggering the CHECK, the root cause of this is that we are > altering the project ID of the persistent volume, which permanently > misattributes the corresponding quote. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9300) XFS isolator can mislabel project IDs on persistence volumes.
James Peach created MESOS-9300: -- Summary: XFS isolator can mislabel project IDs on persistence volumes. Key: MESOS-9300 URL: https://issues.apache.org/jira/browse/MESOS-9300 Project: Mesos Issue Type: Bug Components: agent Reporter: James Peach Assignee: James Peach What happens here is that we are erroneously applying the sandbox's project ID to the persistent volume. First, the filesystem/linux isolator bind mounts the persistent volume into the sandbox: {noformat} I1003 06:49:21.907644 2812466 linux.cpp:593] Mounting '/srv/mesos/work/volumes/roles/pie.mobius/21cb2eb6-b3e5-46f2-944e-8f6e5db9f07f' to '/srv/mesos/work/slaves/909cff92-8e17-41bf-a251-9b5eb6186c35-S0/frameworks/363e6d80-8c38-46cf-815f-2fbf60a62628-0309/executors/mobius-mloop-1538549013_438156792-v2-shared-volume.pod1.writer-job.0.e93hs3uips2i9_1/runs/9e5770a7-9f78-46dc-9264-3e80be0e40cc/shared' for persistent volume disk(allocated: pie.mobius)(reservations: [(DYNAMIC,pie.mobius,jarvis-principal,\{podInstance: e93hs3uips2i9, pod: pod1, service: mobius-mloop-1538549013_438156792-v2-shared-volume})])[21cb2eb6-b3e5-46f2-944e-8f6e5db9f07f:shared]:1 of container 9e5770a7-9f78-46dc-9264-3e80be0e40cc {noformat} Next, the `disk/xfs` isolator assigns a project ID to the sandbox: {noformat} I1003 06:49:21.920197 2812452 disk.cpp:402] Assigned project 6806 to '/srv/mesos/work/slaves/909cff92-8e17-41bf-a251-9b5eb6186c35-S0/frameworks/363e6d80-8c38-46cf-815f-2fbf60a62628-0309/executors/mobius-mloop-1538549013_438156792-v2-shared-volume.pod1.writer-job.0.e93hs3uips2i9_1/runs/9e5770a7-9f78-46dc-9264-3e80be0e40cc' {noformat} Note, that when this happens, the isolator recursively applies the project ID to the contents of the sandbox. It doesn't follow symlinks or cross devices when it does this, but on Linux, a bind mount would not trigger either of these conditions. Finally, the `disk/xfs` isolator tries to assign a project ID to the persistent volume as it is used by the task: {noformat} F1003 06:49:21.920577 2812452 disk.cpp:532] Check failed: scheduledProjects.contains(projectId.get()) untracked project ID 6806 for volume ID 21cb2eb6-b3e5-46f2-944e-8f6e5db9f07f on /srv/mesos/work/volumes/roles/pie.mobius/21cb2eb6-b3e5-46f2-944e-8f6e5db9f07f {noformat} This check fails, because if the persistent volume has a project ID, we expect that is had already be scheduled for reclaimation. However, it's project ID is the one we assigned to the sandbox. We don't scheduled the ssandbox for reclaimation until cleanup, so (fortunately) the invariant check triggers. So, apart from triggering the CHECK, the root cause of this is that we are altering the project ID of the persistent volume, which permanently misattributes the corresponding quote. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-895) Unbundle libev.
[ https://issues.apache.org/jira/browse/MESOS-895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624201#comment-16624201 ] James Peach commented on MESOS-895: --- {noformat} commit 0b9861e356ec2d7d50163ae54a6be9c1c45f279b Author: James Peach Date: Fri Sep 21 14:13:29 2018 -0700 Removed bundled libev patch. Since we now disable the libev SIGCHLD handler at runtime, we no longer need to bundle the patch to do it at build time. It is still useful to bundle libev itself, to support older distributions. Review: https://reviews.apache.org/r/68800/ {noformat} > Unbundle libev. > --- > > Key: MESOS-895 > URL: https://issues.apache.org/jira/browse/MESOS-895 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.17.0 >Reporter: Timothy St. Clair >Assignee: James Peach >Priority: Major > Labels: tech-debt > > The libev patch can easily be removed and update the configuration flags and > possibly the accompanying code prior to include. > For configure pass in: > CFLAGS=-DEV_CHILD_ENABLE=0 > For inclusion: > #define EV_CHILD_ENABLE 0 > include > excerpt from maintainer: > that patch is unnecessary > schmorp, so if they wanted to just set EV_CHILD_ENABLE=0 they > could just pass CFLAGS=-DEV_CHILD_ENABLE=0 through. > tstclair: yes, or use a wrapper -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-895) Unbundle libev.
[ https://issues.apache.org/jira/browse/MESOS-895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624201#comment-16624201 ] James Peach edited comment on MESOS-895 at 9/21/18 9:21 PM: {noformat} commit 0b9861e356ec2d7d50163ae54a6be9c1c45f279b Author: James Peach Date: Fri Sep 21 14:13:29 2018 -0700 Removed bundled libev patch. Since we now disable the libev SIGCHLD handler at runtime, we no longer need to bundle the patch to do it at build time. It is still useful to bundle libev itself, to support older distributions. Review: https://reviews.apache.org/r/68800/ {noformat} was (Author: jamespeach): {noformat} commit 0b9861e356ec2d7d50163ae54a6be9c1c45f279b Author: James Peach Date: Fri Sep 21 14:13:29 2018 -0700 Removed bundled libev patch. Since we now disable the libev SIGCHLD handler at runtime, we no longer need to bundle the patch to do it at build time. It is still useful to bundle libev itself, to support older distributions. Review: https://reviews.apache.org/r/68800/ {noformat} > Unbundle libev. > --- > > Key: MESOS-895 > URL: https://issues.apache.org/jira/browse/MESOS-895 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.17.0 >Reporter: Timothy St. Clair >Assignee: James Peach >Priority: Major > Labels: tech-debt > > The libev patch can easily be removed and update the configuration flags and > possibly the accompanying code prior to include. > For configure pass in: > CFLAGS=-DEV_CHILD_ENABLE=0 > For inclusion: > #define EV_CHILD_ENABLE 0 > include > excerpt from maintainer: > that patch is unnecessary > schmorp, so if they wanted to just set EV_CHILD_ENABLE=0 they > could just pass CFLAGS=-DEV_CHILD_ENABLE=0 through. > tstclair: yes, or use a wrapper -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9246) Verify libarchive version at configuration time.
[ https://issues.apache.org/jira/browse/MESOS-9246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16622567#comment-16622567 ] James Peach commented on MESOS-9246: /cc [~andschwa] > Verify libarchive version at configuration time. > > > Key: MESOS-9246 > URL: https://issues.apache.org/jira/browse/MESOS-9246 > Project: Mesos > Issue Type: Bug >Reporter: James Peach >Priority: Major > > The Mesos build system doesn't verify that {{libarchive}} is a new enough > version to provide all the APIs that Mesos needs. For example, on CentOS 6 > with {{libarchive}} 2.8.3, the build will fail: > {noformat} > ../../3rdparty/stout/include/stout/archiver.hpp: In function 'Try > archiver::extract(const string&, const string&, int)': > ../../3rdparty/stout/include/stout/archiver.hpp:55:47: error: > 'archive_read_support_filter_all' was not declared in this scope >archive_read_support_filter_all(reader.get()); >^ > ../../3rdparty/stout/include/stout/archiver.hpp: In lambda function: > ../../3rdparty/stout/include/stout/archiver.hpp:61:27: error: > 'archive_write_free' was not declared in this scope >archive_write_free(p); >^ > ../../3rdparty/stout/include/stout/archiver.hpp: In function 'Try > archiver::extract(const string&, const string&, int)': > ../../3rdparty/stout/include/stout/archiver.hpp:120:70: error: > 'archive_entry_hardlink_utf8' was not declared in this scope >const char* hardlink_target = archive_entry_hardlink_utf8(entry); > ^ > ../../3rdparty/stout/include/stout/archiver.hpp:130:68: error: > 'archive_entry_pathname_utf8' was not declared in this scope >path::join(destination, > archive_entry_pathname_utf8(entry)).c_str()); > {noformat} > We should verify that new APIs we need are present at configuration time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9246) Verify libarchive version at configuration time.
James Peach created MESOS-9246: -- Summary: Verify libarchive version at configuration time. Key: MESOS-9246 URL: https://issues.apache.org/jira/browse/MESOS-9246 Project: Mesos Issue Type: Bug Reporter: James Peach The Mesos build system doesn't verify that {{libarchive}} is a new enough version to provide all the APIs that Mesos needs. For example, on CentOS 6 with {{libarchive}} 2.8.3, the build will fail: {noformat} ../../3rdparty/stout/include/stout/archiver.hpp: In function 'Try archiver::extract(const string&, const string&, int)': ../../3rdparty/stout/include/stout/archiver.hpp:55:47: error: 'archive_read_support_filter_all' was not declared in this scope archive_read_support_filter_all(reader.get()); ^ ../../3rdparty/stout/include/stout/archiver.hpp: In lambda function: ../../3rdparty/stout/include/stout/archiver.hpp:61:27: error: 'archive_write_free' was not declared in this scope archive_write_free(p); ^ ../../3rdparty/stout/include/stout/archiver.hpp: In function 'Try archiver::extract(const string&, const string&, int)': ../../3rdparty/stout/include/stout/archiver.hpp:120:70: error: 'archive_entry_hardlink_utf8' was not declared in this scope const char* hardlink_target = archive_entry_hardlink_utf8(entry); ^ ../../3rdparty/stout/include/stout/archiver.hpp:130:68: error: 'archive_entry_pathname_utf8' was not declared in this scope path::join(destination, archive_entry_pathname_utf8(entry)).c_str()); {noformat} We should verify that new APIs we need are present at configuration time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9240) CSI protobuf build fails when dependency tracking is disabled.
James Peach created MESOS-9240: -- Summary: CSI protobuf build fails when dependency tracking is disabled. Key: MESOS-9240 URL: https://issues.apache.org/jira/browse/MESOS-9240 Project: Mesos Issue Type: Bug Components: build Reporter: James Peach Assignee: James Peach Generating the CSI protobufs depends on the "$(builddir)/include/csi" directory being created as configuration time. This only happens when automate build dependencies are enabled, however. By default, rpmbuild will pass {{\--disable-dependency-tracking}}, which will prevent this directory being created, and the build will fail like so: {noformat} ./../include/mesos/v1/master/master.proto /usr/bin/protoc -I../../include -I../../src -I../3rdparty/csi-0.2.0 --cpp_out=../include ../../include/mesos/v1/quota/quota.proto /usr/bin/protoc -I../../include -I../../src -I../3rdparty/csi-0.2.0 --cpp_out=../include ../../include/mesos/v1/resource_provider/resource_provider.proto ../include/csi/: No such file or directory /usr/bin/protoc -I../../include -I../../src -I../3rdparty/csi-0.2.0 --cpp_out=../include ../../include/mesos/v1/scheduler/scheduler.proto /usr/bin/protoc -I../../include -I../../src -I../3rdparty/csi-0.2.0 --cpp_out=. ../../src/master/registry.proto make[2]: *** [../include/csi/csi.grpc.pb.cc] Error 1 make[2]: *** Waiting for unfinished jobs ../include/csi/: No such file or directory make[2]: *** [../include/csi/csi.pb.cc] Error 1 {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-895) Unbundle libev.
[ https://issues.apache.org/jira/browse/MESOS-895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-895: - Assignee: James Peach (was: Timothy St. Clair) CentOS 6 ships {{libev}} 4.03 and and Ubuntu 14.04 ships 4.15, so once MESOS-9212 lands, I think we can unbundle {{libev}}. /cc [~tillt] [~bmahler] [~vinodkone] > Unbundle libev. > --- > > Key: MESOS-895 > URL: https://issues.apache.org/jira/browse/MESOS-895 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.17.0 >Reporter: Timothy St. Clair >Assignee: James Peach >Priority: Major > Labels: tech-debt > > The libev patch can easily be removed and update the configuration flags and > possibly the accompanying code prior to include. > For configure pass in: > CFLAGS=-DEV_CHILD_ENABLE=0 > For inclusion: > #define EV_CHILD_ENABLE 0 > include > excerpt from maintainer: > that patch is unnecessary > schmorp, so if they wanted to just set EV_CHILD_ENABLE=0 they > could just pass CFLAGS=-DEV_CHILD_ENABLE=0 through. > tstclair: yes, or use a wrapper -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9178) Add a metric for master failover time.
[ https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16610795#comment-16610795 ] James Peach commented on MESOS-9178: Another way to measure this is to publish it in the event stream. > Add a metric for master failover time. > -- > > Key: MESOS-9178 > URL: https://issues.apache.org/jira/browse/MESOS-9178 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Xudong Ni >Assignee: Xudong Ni >Priority: Minor > > Quote from Yan Xu: Previous the argument against it is that you don't know if > all agents are going to come back after a master failover so there's not a > certain point that marks the end of "full reregistration of all agents". > However empirically the number of agents usually don't change during the > failover and there's an upper bound of such wait (after a 10min timeout the > agents that haven't reregistered are going to be marked unreachable so we can > just use that to stop the timer. > So we can define failover time as "the time it takes for all agents recovered > from the registry to be accounted for" i.e., either reregistered or marked as > unreachable. > This is of course looking at failover from an agent reregistration > perspective. > Later after we add framework info persistence, we can similarly define the > framework perspective using reregistration time or reconciliation time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9178) Add a metric for master failover time.
[ https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16609911#comment-16609911 ] James Peach commented on MESOS-9178: Say you have a time-series gauge at various percentages as per [~bmahler]'s suggestion. The gauge value would have to persist, so once it is set, it would remain at that value thereafter. If you needed to do analytics, you need to carefully choose the first sample after a failover. For time-series, the easiest thing to do is to plot it, and it's not at all clear to me how you could do that and show a meaningful graph because what you really want is to compare the historical failover times. I'm not that experienced with Grafana but I don't know how I would do that. > Add a metric for master failover time. > -- > > Key: MESOS-9178 > URL: https://issues.apache.org/jira/browse/MESOS-9178 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Xudong Ni >Assignee: Xudong Ni >Priority: Minor > > Quote from Yan Xu: Previous the argument against it is that you don't know if > all agents are going to come back after a master failover so there's not a > certain point that marks the end of "full reregistration of all agents". > However empirically the number of agents usually don't change during the > failover and there's an upper bound of such wait (after a 10min timeout the > agents that haven't reregistered are going to be marked unreachable so we can > just use that to stop the timer. > So we can define failover time as "the time it takes for all agents recovered > from the registry to be accounted for" i.e., either reregistered or marked as > unreachable. > This is of course looking at failover from an agent reregistration > perspective. > Later after we add framework info persistence, we can similarly define the > framework perspective using reregistration time or reconciliation time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9178) Add a metric for master failover time.
[ https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16607548#comment-16607548 ] James Peach commented on MESOS-9178: Are we convinced that a metric is the right approach? This seems like something that you might want to compare over long time periods which might be more suitable to doing analytics on logs > Add a metric for master failover time. > -- > > Key: MESOS-9178 > URL: https://issues.apache.org/jira/browse/MESOS-9178 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Xudong Ni >Assignee: Xudong Ni >Priority: Minor > > Quote from Yan Xu: Previous the argument against it is that you don't know if > all agents are going to come back after a master failover so there's not a > certain point that marks the end of "full reregistration of all agents". > However empirically the number of agents usually don't change during the > failover and there's an upper bound of such wait (after a 10min timeout the > agents that haven't reregistered are going to be marked unreachable so we can > just use that to stop the timer. > So we can define failover time as "the time it takes for all agents recovered > from the registry to be accounted for" i.e., either reregistered or marked as > unreachable. > This is of course looking at failover from an agent reregistration > perspective. > Later after we add framework info persistence, we can similarly define the > framework perspective using reregistration time or reconciliation time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9212) Disable SIGCHILD handling in libev.
[ https://issues.apache.org/jira/browse/MESOS-9212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-9212: -- Assignee: James Peach | [r/68660|https://reviews.apache.org/r/68660] | Disabled SIGCHLD handling in the libev event loop. | > Disable SIGCHILD handling in libev. > --- > > Key: MESOS-9212 > URL: https://issues.apache.org/jira/browse/MESOS-9212 > Project: Mesos > Issue Type: Bug >Reporter: James Peach >Assignee: James Peach >Priority: Major > > On Fedora 28, building against the system version of libev (version 4.24) > causes the following tests to fail: > The following tests fail: > {noformat} > [ FAILED ] ReapTest.NonChildProcess > [ FAILED ] ReapTest.ChildProcess > [ FAILED ] ReapTest.TerminatedChildProcess > [ FAILED ] SubprocessTest.PipeOutputToFileDescriptor > [ FAILED ] SubprocessTest.PipeOutputToPath > [ FAILED ] SubprocessTest.EnvironmentEcho > [ FAILED ] SubprocessTest.Status > [ FAILED ] SubprocessTest.PipeOutput > [ FAILED ] SubprocessTest.PipeLargeOutput > [ FAILED ] SubprocessTest.PipeInput > [ FAILED ] SubprocessTest.PipeRedirect > [ FAILED ] SubprocessTest.PathOutput > [ FAILED ] SubprocessTest.PathInput > [ FAILED ] SubprocessTest.FdOutput > [ FAILED ] SubprocessTest.FdInput > [ FAILED ] SubprocessTest.Default > [ FAILED ] SubprocessTest.Flags > [ FAILED ] SubprocessTest.Environment > [ FAILED ] SubprocessTest.EnvironmentWithSpaces > [ FAILED ] SubprocessTest.EnvironmentWithSpacesAndQuotes > [ FAILED ] SubprocessTest.EnvironmentOverride > {noformat} > This build configuration succeeds: > {noformat} > $ ../configure --disable-java --disable-python --enable-silent-rules > --disable-hardening --disable-werror --disable-libtool-wrappers > --enable-xfs-disk-isolator --enable-install-module-dependencies > --enable-port-mapping-isolator --enable-network-ports-isolator > --with-protobuf=/usr --with-curl=/usr --with-libarchive=/usr > --with-zookeeper=/usr --prefix=/opt/mesos "CXXFLAGS=-O0 -ggdb3 > -fno-omit-frame-pointer -fvisibility-inlines-hidden > -Wno-unused-local-typedefs -Wno-deprecated" "CFLAGS=-O0 -ggdb3 > -fno-omit-frame-pointer -Wno-unused-local-typedefs -Wno-deprecated" LDFLAGS= > CXX=/home/jpeach/src/asf-mesos/build/c++ > CC=/home/jpeach/src/asf-mesos/build/cc LD=/home/jpeach/src/asf-mesos/build/ld > {noformat} > This build configuration fails: > {noformat} > $ ../configure --disable-java --disable-python --enable-silent-rules > --disable-hardening --disable-werror --disable-libtool-wrappers > --enable-xfs-disk-isolator --enable-install-module-dependencies > --enable-port-mapping-isolator --enable-network-ports-isolator > --with-protobuf=/usr --with-curl=/usr --with-libarchive=/usr > --with-zookeeper=/usr --prefix=/opt/mesos "CXXFLAGS=-O0 -ggdb3 > -fno-omit-frame-pointer -fvisibility-inlines-hidden > -Wno-unused-local-typedefs -Wno-deprecated" "CFLAGS=-O0 -ggdb3 > -fno-omit-frame-pointer -Wno-unused-local-typedefs -Wno-deprecated" LDFLAGS= > CXX=/home/jpeach/src/asf-mesos/build/c++ > CC=/home/jpeach/src/asf-mesos/build/cc LD=/home/jpeach/src/asf-mesos/build/ld > --with-libev=/usr > {noformat} > I think what happens here is that the child process gets reaped wrongly > somehow: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from SubprocessTest > [ RUN ] SubprocessTest.EnvironmentWithSpaces > [pid 25909] clone(child_stack=NULL, > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, > child_tidptr=0x7fa11881fcd0) = 25923 > strace: Process 25923 attached > [pid 25923] execve("/usr/bin/sh", ["sh", "-c", "echo $MESSAGE"], 0x1ff3950 /* > 1 var */) = 0 > [pid 25923] arch_prctl(ARCH_SET_FS, 0x7f24561c5740) = 0 > [pid 25923] exit_group(0) = ? > [pid 25923] +++ exited with 0 +++ > [pid 25909] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=25923, > si_uid=9306, si_status=0, si_utime=0, si_stime=0} --- > [pid 25922] wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], > WNOHANG|WSTOPPED|WCONTINUED, NULL) = 25923 > [pid 25922] wait4(-1, 0x7fa10a74da44, WNOHANG|WSTOPPED|WCONTINUED, NULL) = -1 > ECHILD (No child processes) > [pid 25919] wait4(25923, 0x7fa10bf50548, WNOHANG, NULL) = -1 ECHILD (No child > processes) > ../../../3rdparty/libprocess/src/tests/subprocess_tests.cpp:977: Failure > (s->status()).get() is NONE > [ FAILED ] SubprocessTest.EnvironmentWithSpaces (12 ms) > [--] 1 test from SubprocessTest (12 ms total) > [--] Global test environment tear-down > [==] 1 test from 1 test case ran. (12 ms total) > [ PASSED ] 0 tests. > [ FAILED ] 1 test, listed below: > [ FAILED ] SubprocessTest.EnvironmentWithSpaces > {noformat} -- This message was sent by
[jira] [Commented] (MESOS-9212) Subprocess tests fail with libev 4.24
[ https://issues.apache.org/jira/browse/MESOS-9212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16606233#comment-16606233 ] James Peach commented on MESOS-9212: This might be due to the libev patch we are carrying? {noformat} [jpeach@jpeach 3rdparty]$ cat libev-4.22.patch diff --git a/ev.h b/ev.h index 38f62d8..0055cfd 100644 --- a/ev.h +++ b/ev.h @@ -125,7 +125,7 @@ EV_CPP(extern "C" {) # ifdef _WIN32 # define EV_CHILD_ENABLE 0 # else -# define EV_CHILD_ENABLE EV_FEATURE_WATCHERS +# define EV_CHILD_ENABLE 0 #endif #endif [jpeach@jpeach 3rdparty]$ grep -r EV_CHILD_ENABLE /usr/include/ /usr/include/ev.h:#ifndef EV_CHILD_ENABLE /usr/include/ev.h:# define EV_CHILD_ENABLE 0 /usr/include/ev.h:# define EV_CHILD_ENABLE EV_FEATURE_WATCHERS /usr/include/ev.h:#if EV_CHILD_ENABLE && !EV_SIGNAL_ENABLE /usr/include/ev.h:# if EV_CHILD_ENABLE /usr/include/ev++.h: #if EV_CHILD_ENABLE {noformat} > Subprocess tests fail with libev 4.24 > - > > Key: MESOS-9212 > URL: https://issues.apache.org/jira/browse/MESOS-9212 > Project: Mesos > Issue Type: Bug >Reporter: James Peach >Priority: Major > > On Fedora 28, building against the system version of libev (version 4.24) > causes the following tests to fail: > The following tests fail: > {noformat} > [ FAILED ] ReapTest.NonChildProcess > [ FAILED ] ReapTest.ChildProcess > [ FAILED ] ReapTest.TerminatedChildProcess > [ FAILED ] SubprocessTest.PipeOutputToFileDescriptor > [ FAILED ] SubprocessTest.PipeOutputToPath > [ FAILED ] SubprocessTest.EnvironmentEcho > [ FAILED ] SubprocessTest.Status > [ FAILED ] SubprocessTest.PipeOutput > [ FAILED ] SubprocessTest.PipeLargeOutput > [ FAILED ] SubprocessTest.PipeInput > [ FAILED ] SubprocessTest.PipeRedirect > [ FAILED ] SubprocessTest.PathOutput > [ FAILED ] SubprocessTest.PathInput > [ FAILED ] SubprocessTest.FdOutput > [ FAILED ] SubprocessTest.FdInput > [ FAILED ] SubprocessTest.Default > [ FAILED ] SubprocessTest.Flags > [ FAILED ] SubprocessTest.Environment > [ FAILED ] SubprocessTest.EnvironmentWithSpaces > [ FAILED ] SubprocessTest.EnvironmentWithSpacesAndQuotes > [ FAILED ] SubprocessTest.EnvironmentOverride > {noformat} > This build configuration succeeds: > {noformat} > $ ../configure --disable-java --disable-python --enable-silent-rules > --disable-hardening --disable-werror --disable-libtool-wrappers > --enable-xfs-disk-isolator --enable-install-module-dependencies > --enable-port-mapping-isolator --enable-network-ports-isolator > --with-protobuf=/usr --with-curl=/usr --with-libarchive=/usr > --with-zookeeper=/usr --prefix=/opt/mesos "CXXFLAGS=-O0 -ggdb3 > -fno-omit-frame-pointer -fvisibility-inlines-hidden > -Wno-unused-local-typedefs -Wno-deprecated" "CFLAGS=-O0 -ggdb3 > -fno-omit-frame-pointer -Wno-unused-local-typedefs -Wno-deprecated" LDFLAGS= > CXX=/home/jpeach/src/asf-mesos/build/c++ > CC=/home/jpeach/src/asf-mesos/build/cc LD=/home/jpeach/src/asf-mesos/build/ld > {noformat} > This build configuration fails: > {noformat} > $ ../configure --disable-java --disable-python --enable-silent-rules > --disable-hardening --disable-werror --disable-libtool-wrappers > --enable-xfs-disk-isolator --enable-install-module-dependencies > --enable-port-mapping-isolator --enable-network-ports-isolator > --with-protobuf=/usr --with-curl=/usr --with-libarchive=/usr > --with-zookeeper=/usr --prefix=/opt/mesos "CXXFLAGS=-O0 -ggdb3 > -fno-omit-frame-pointer -fvisibility-inlines-hidden > -Wno-unused-local-typedefs -Wno-deprecated" "CFLAGS=-O0 -ggdb3 > -fno-omit-frame-pointer -Wno-unused-local-typedefs -Wno-deprecated" LDFLAGS= > CXX=/home/jpeach/src/asf-mesos/build/c++ > CC=/home/jpeach/src/asf-mesos/build/cc LD=/home/jpeach/src/asf-mesos/build/ld > --with-libev=/usr > {noformat} > I think what happens here is that the child process gets reaped wrongly > somehow: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from SubprocessTest > [ RUN ] SubprocessTest.EnvironmentWithSpaces > [pid 25909] clone(child_stack=NULL, > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, > child_tidptr=0x7fa11881fcd0) = 25923 > strace: Process 25923 attached > [pid 25923] execve("/usr/bin/sh", ["sh", "-c", "echo $MESSAGE"], 0x1ff3950 /* > 1 var */) = 0 > [pid 25923] arch_prctl(ARCH_SET_FS, 0x7f24561c5740) = 0 > [pid 25923] exit_group(0) = ? > [pid 25923] +++ exited with 0 +++ > [pid 25909] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=25923, > si_uid=9306, si_status=0, si_utime=0, si_stime=0} --- > [pid 25922] wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], > WNOHANG|WSTOPPED|WCONTINUED, NULL) = 25923 > [pid 25922] wait4(-1, 0x7fa10a74da44, WNOHANG|WSTOPPED|WCONTINUED, NULL) = -1 >
[jira] [Created] (MESOS-9212) Subprocess tests fail with libev 4.24
James Peach created MESOS-9212: -- Summary: Subprocess tests fail with libev 4.24 Key: MESOS-9212 URL: https://issues.apache.org/jira/browse/MESOS-9212 Project: Mesos Issue Type: Bug Reporter: James Peach On Fedora 28, building against the system version of libev (version 4.24) causes the following tests to fail: The following tests fail: {noformat} [ FAILED ] ReapTest.NonChildProcess [ FAILED ] ReapTest.ChildProcess [ FAILED ] ReapTest.TerminatedChildProcess [ FAILED ] SubprocessTest.PipeOutputToFileDescriptor [ FAILED ] SubprocessTest.PipeOutputToPath [ FAILED ] SubprocessTest.EnvironmentEcho [ FAILED ] SubprocessTest.Status [ FAILED ] SubprocessTest.PipeOutput [ FAILED ] SubprocessTest.PipeLargeOutput [ FAILED ] SubprocessTest.PipeInput [ FAILED ] SubprocessTest.PipeRedirect [ FAILED ] SubprocessTest.PathOutput [ FAILED ] SubprocessTest.PathInput [ FAILED ] SubprocessTest.FdOutput [ FAILED ] SubprocessTest.FdInput [ FAILED ] SubprocessTest.Default [ FAILED ] SubprocessTest.Flags [ FAILED ] SubprocessTest.Environment [ FAILED ] SubprocessTest.EnvironmentWithSpaces [ FAILED ] SubprocessTest.EnvironmentWithSpacesAndQuotes [ FAILED ] SubprocessTest.EnvironmentOverride {noformat} This build configuration succeeds: {noformat} $ ../configure --disable-java --disable-python --enable-silent-rules --disable-hardening --disable-werror --disable-libtool-wrappers --enable-xfs-disk-isolator --enable-install-module-dependencies --enable-port-mapping-isolator --enable-network-ports-isolator --with-protobuf=/usr --with-curl=/usr --with-libarchive=/usr --with-zookeeper=/usr --prefix=/opt/mesos "CXXFLAGS=-O0 -ggdb3 -fno-omit-frame-pointer -fvisibility-inlines-hidden -Wno-unused-local-typedefs -Wno-deprecated" "CFLAGS=-O0 -ggdb3 -fno-omit-frame-pointer -Wno-unused-local-typedefs -Wno-deprecated" LDFLAGS= CXX=/home/jpeach/src/asf-mesos/build/c++ CC=/home/jpeach/src/asf-mesos/build/cc LD=/home/jpeach/src/asf-mesos/build/ld {noformat} This build configuration fails: {noformat} $ ../configure --disable-java --disable-python --enable-silent-rules --disable-hardening --disable-werror --disable-libtool-wrappers --enable-xfs-disk-isolator --enable-install-module-dependencies --enable-port-mapping-isolator --enable-network-ports-isolator --with-protobuf=/usr --with-curl=/usr --with-libarchive=/usr --with-zookeeper=/usr --prefix=/opt/mesos "CXXFLAGS=-O0 -ggdb3 -fno-omit-frame-pointer -fvisibility-inlines-hidden -Wno-unused-local-typedefs -Wno-deprecated" "CFLAGS=-O0 -ggdb3 -fno-omit-frame-pointer -Wno-unused-local-typedefs -Wno-deprecated" LDFLAGS= CXX=/home/jpeach/src/asf-mesos/build/c++ CC=/home/jpeach/src/asf-mesos/build/cc LD=/home/jpeach/src/asf-mesos/build/ld --with-libev=/usr {noformat} I think what happens here is that the child process gets reaped wrongly somehow: {noformat} [==] Running 1 test from 1 test case. [--] Global test environment set-up. [--] 1 test from SubprocessTest [ RUN ] SubprocessTest.EnvironmentWithSpaces [pid 25909] clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fa11881fcd0) = 25923 strace: Process 25923 attached [pid 25923] execve("/usr/bin/sh", ["sh", "-c", "echo $MESSAGE"], 0x1ff3950 /* 1 var */) = 0 [pid 25923] arch_prctl(ARCH_SET_FS, 0x7f24561c5740) = 0 [pid 25923] exit_group(0) = ? [pid 25923] +++ exited with 0 +++ [pid 25909] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=25923, si_uid=9306, si_status=0, si_utime=0, si_stime=0} --- [pid 25922] wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED|WCONTINUED, NULL) = 25923 [pid 25922] wait4(-1, 0x7fa10a74da44, WNOHANG|WSTOPPED|WCONTINUED, NULL) = -1 ECHILD (No child processes) [pid 25919] wait4(25923, 0x7fa10bf50548, WNOHANG, NULL) = -1 ECHILD (No child processes) ../../../3rdparty/libprocess/src/tests/subprocess_tests.cpp:977: Failure (s->status()).get() is NONE [ FAILED ] SubprocessTest.EnvironmentWithSpaces (12 ms) [--] 1 test from SubprocessTest (12 ms total) [--] Global test environment tear-down [==] 1 test from 1 test case ran. (12 ms total) [ PASSED ] 0 tests. [ FAILED ] 1 test, listed below: [ FAILED ] SubprocessTest.EnvironmentWithSpaces {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-9172) Fetcher deadlock with duplicated URIs.
[ https://issues.apache.org/jira/browse/MESOS-9172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598998#comment-16598998 ] James Peach edited comment on MESOS-9172 at 8/31/18 4:46 PM: - | [r/68587|https://reviews.apache.org/r/68587] | Fixed fetcher deadlock with duplicate URIs. | | [r/68586|https://reviews.apache.org/r/68586] | Add the output file to the hash on CommandInfo::URI. | was (Author: jamespeach): | [r/68587|https://reviews.apache.org/*r/68587] | Fixed fetcher deadlock with duplicate URIs. | | [r/68586|https://reviews.apache.org/*r/68586] | Add the output file to the hash on CommandInfo::URI. | > Fetcher deadlock with duplicated URIs. > -- > > Key: MESOS-9172 > URL: https://issues.apache.org/jira/browse/MESOS-9172 > Project: Mesos > Issue Type: Bug > Components: fetcher >Reporter: James Peach >Assignee: James Peach >Priority: Major > > If the fetcher cache is empty and you launch a task that contains duplicate > URIs, the fetcher deadlocks waiting for the futures in > {{FetcherProcess::_fetch}}. > What happens is that when the fetcher is setting up the initial match of > cache lookup futures in {{FetcherProcess::fetch}}, the duplicate URIs cause > cache hits on the placeholder cache entries. This code is assuming that there > is already an operation in flight that will populate the cache entry. > However, the cache is currently empty - the placeholder entry is caused by a > the duplicate in the task's URIs. > When we await the futures in {{FetcherProcess::_fetch}}, we end up waiting > for the future that indicated the cache entry becomes populated, but that > won't ever happen because we need to make progress on the current fetching > batch in order to populate the cache entry. At this point we are live-locked. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9192) Mesos build fail on Ubuntu 14.04.
[ https://issues.apache.org/jira/browse/MESOS-9192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596866#comment-16596866 ] James Peach commented on MESOS-9192: Per [the docs|http://mesos.apache.org/documentation/latest/building/] we require clang >= 3.5. Maybe we ought to add a version check to the build like we did for GCC? > Mesos build fail on Ubuntu 14.04. > - > > Key: MESOS-9192 > URL: https://issues.apache.org/jira/browse/MESOS-9192 > Project: Mesos > Issue Type: Bug >Reporter: Meng Zhu >Priority: Major > > Ubuntu 14.04, clang3.4 > If I manually install protobuf-compiler, the build will pass. > {noformat} > make[3]: Entering directory > `/home/mengzhu/workspace/mesos_current/build/3rdparty' > cd grpc-1.10.0 && \ > > CPPFLAGS="-I/home/mengzhu/workspace/mesos_current/build/3rdparty/protobuf-3.5.0/src >\ > \ > \ > -Wno-array-bounds \ > -I/usr/include/subversion-1 -I/usr/include/apr-1 > -I/usr/include/apr-1.0 " \ > CFLAGS="-g1 -O0" \ > CXXFLAGS="-g1 -O0 -Wno-inconsistent-missing-override -std=c++11" > \ > make \ > > /home/mengzhu/workspace/mesos_current/build/3rdparty/grpc-1.10.0/libs/opt/libgrpc++_unsecure.a > > /home/mengzhu/workspace/mesos_current/build/3rdparty/grpc-1.10.0/libs/opt/libgrpc_unsecure.a > > /home/mengzhu/workspace/mesos_current/build/3rdparty/grpc-1.10.0/libs/opt/libgpr.a > \ > CC="clang" \ > CXX="clang++" \ > LD="clang" \ > LDXX="clang++" \ > > LDFLAGS="-L/home/mengzhu/workspace/mesos_current/build/3rdparty/protobuf-3.5.0/src/.libs > \ > \ > \ > " \ > LDLIBS="" \ > HAS_PKG_CONFIG=false\ > NO_PROTOC=false \ > > PROTOC="/home/mengzhu/workspace/mesos_current/build/3rdparty/protobuf-3.5.0/src/protoc" > make[4]: Entering directory > `/home/mengzhu/workspace/mesos_current/build/3rdparty/grpc-1.10.0' > DEPENDENCY ERROR > The target you are trying to run requires protobuf 3.0.0+ > Your system doesn't have it, and neither does the third_party directory. > Please consult INSTALL to get more information. > If you need information about why these tests failed, run: > make run_dep_checks > make[4]: *** [stop] Error 1 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9178) Add a metric for master failover time.
[ https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589159#comment-16589159 ] James Peach commented on MESOS-9178: /cc [~bmahler] > Add a metric for master failover time. > -- > > Key: MESOS-9178 > URL: https://issues.apache.org/jira/browse/MESOS-9178 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Xudong Ni >Assignee: Xudong Ni >Priority: Minor > > Quote from Yan Xu: Previous the argument against it is that you don't know if > all agents are going to come back after a master failover so there's not a > certain point that marks the end of "full reregistration of all agents". > However empirically the number of agents usually don't change during the > failover and there's an upper bound of such wait (after a 10min timeout the > agents that haven't reregistered are going to be marked unreachable so we can > just use that to stop the timer. > So we can define failover time as "the time it takes for all agents recovered > from the registry to be accounted for" i.e., either reregistered or marked as > unreachable. > This is of course looking at failover from an agent reregistration > perspective. > Later after we add framework info persistence, we can similarly define the > framework perspective using reregistration time or reconciliation time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9175) `Subprocess::FD` can leak file descriptors into child processes
James Peach created MESOS-9175: -- Summary: `Subprocess::FD` can leak file descriptors into child processes Key: MESOS-9175 URL: https://issues.apache.org/jira/browse/MESOS-9175 Project: Mesos Issue Type: Bug Components: libprocess Reporter: James Peach When you use the {{subprocess}} API, you can use {{Subprocess::FD()}} to define how the standard IO streams are attached to the child process. The default type argument is {{IO::DUPLICATED}}. In that case, the descriptors are duplicated with {{dup(2)}} in the parent process. The new file descriptors will have their close-on-exec flag cleared and could then be inherited to undefined child processes. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9172) Fetcher deadlock with duplicated URIs.
James Peach created MESOS-9172: -- Summary: Fetcher deadlock with duplicated URIs. Key: MESOS-9172 URL: https://issues.apache.org/jira/browse/MESOS-9172 Project: Mesos Issue Type: Bug Components: fetcher Reporter: James Peach Assignee: James Peach If the fetcher cache is empty and you launch a task that contains duplicate URIs, the fetcher deadlocks waiting for the futures in {{FetcherProcess::_fetch}}. What happens is that when the fetcher is setting up the initial match of cache lookup futures in {{FetcherProcess::fetch}}, the duplicate URIs cause cache hits on the placeholder cache entries. This code is assuming that there is already an operation in flight that will populate the cache entry. However, the cache is currently empty - the placeholder entry is caused by a the duplicate in the task's URIs. When we await the futures in {{FetcherProcess::_fetch}}, we end up waiting for the future that indicated the cache entry becomes populated, but that won't ever happen because we need to make progress on the current fetching batch in order to populate the cache entry. At this point we are live-locked. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9164) Subprocess should unset CLOEXEC on whitelisted fils descriptors
[ https://issues.apache.org/jira/browse/MESOS-9164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-9164: -- Assignee: James Peach > Subprocess should unset CLOEXEC on whitelisted fils descriptors > --- > > Key: MESOS-9164 > URL: https://issues.apache.org/jira/browse/MESOS-9164 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: James Peach >Assignee: James Peach >Priority: Major > > The libprocess subprocess API accepts a set of whitelisted file descriptors > that are supposed to be inherited to the child process. On windows, these > are used, but otherwise the subprocess API just ignores them. We probably > should make sure that the API clears the {{CLOEXEC}} flag on this descriptors > so that they are inherited to the child. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9164) Subprocess should unset CLOEXEC on whitelisted fils descriptors
James Peach created MESOS-9164: -- Summary: Subprocess should unset CLOEXEC on whitelisted fils descriptors Key: MESOS-9164 URL: https://issues.apache.org/jira/browse/MESOS-9164 Project: Mesos Issue Type: Bug Components: libprocess Reporter: James Peach The libprocess subprocess API accepts a set of whitelisted file descriptors that are supposed to be inherited to the child process. On windows, these are used, but otherwise the subprocess API just ignores them. We probably should make sure that the API clears the {{CLOEXEC}} flag on this descriptors so that they are inherited to the child. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9161) Build bundled ZK with SOCK_CLOEXEC
James Peach created MESOS-9161: -- Summary: Build bundled ZK with SOCK_CLOEXEC Key: MESOS-9161 URL: https://issues.apache.org/jira/browse/MESOS-9161 Project: Mesos Issue Type: Bug Components: build Environment: We should enable {{\--with-sock-cloexec}} in our bundled ZooKeeper client build to enable the fix for ZOOKEEPER-2338 (which opens sockets with the {{SOCK_CLOEXEC}} flag). Reporter: James Peach -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-5158) Provide XFS quota support for persistent volumes.
[ https://issues.apache.org/jira/browse/MESOS-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575181#comment-16575181 ] James Peach commented on MESOS-5158: Working on this now. > Provide XFS quota support for persistent volumes. > - > > Key: MESOS-5158 > URL: https://issues.apache.org/jira/browse/MESOS-5158 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: Yan Xu >Assignee: James Peach >Priority: Major > > Given that the lifecycle of persistent volumes is managed outside of the > isolator, we may need to further abstract out the quota management > functionality to do it outside the XFS isolator. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9138) Crashes in ProcessTest.Process_BENCHMARK_DispatchDefer
James Peach created MESOS-9138: -- Summary: Crashes in ProcessTest.Process_BENCHMARK_DispatchDefer Key: MESOS-9138 URL: https://issues.apache.org/jira/browse/MESOS-9138 Project: Mesos Issue Type: Bug Components: libprocess Reporter: James Peach The `ProcessTest.Process_BENCHMARK_DispatchDefer` benchmark crashes fairly regularly (though not deterministically). {noformat} [ RUN ] ProcessTest.Process_BENCHMARK_DispatchDefer Movable elapsed: 12.65446863100secs ../../../3rdparty/libprocess/src/tests/benchmarks.cpp:572: Failure Failed to wait 15secs for promise.future() benchmarks: ../../../3rdparty/libprocess/include/process/dispatch.hpp:354: auto process::dispatch(const PID &, Future (DispatchProcess::*)(const DispatchProcess::Copyable &), const DispatchProcess::Copyable &&)::(anonymous class)::operator()(std::unique_ptr >, typename std::decay::type &&, process::ProcessBase *) const: Assertion `t != nullptr' failed. WARNING: Logging before InitGoogleLogging() is written to STDERR F0806 15:16:43.668474 28956 process.cpp:3419] Check failed: state.load() == ProcessBase::State::BOTTOM || state.load() == ProcessBase::State::TERMINATING *** Aborted at 1533593803 (unix time) try "date -d @1533593803" if you are using GNU date *** *** Check failure stack trace: *** PC: @ 0x7f24f4327feb __GI_raise *** SIGABRT (@0x245a711c) received by PID 28956 (TID 0x7f24eda65700) from PID 28956; stack trace: *** @ 0x7f24f540bfc0 (unknown) @ 0x7f24f4327feb __GI_raise @ 0x7f24f43125c1 __GI_abort @ 0x7f24f4312491 __assert_fail_base.cold.0 @ 0x7f24f4320752 __GI___assert_fail @ 0x4a8988 _ZZN7process8dispatchI7Nothing15DispatchProcessRKNS2_8CopyableES5_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSA_FS8_T1_EOT2_ENKUlSt10unique_ptrINS_7PromiseIS1_EESt14default_deleteISL_EEOS3_PNS_11ProcessBaseEE_clESO_SP_SR_ @ 0x4a879b _ZN5cpp176invokeIZN7process8dispatchI7Nothing15DispatchProcessRKNS4_8CopyableES7_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSC_FSA_T1_EOT2_EUlSt10unique_ptrINS1_7PromiseIS3_EESt14default_deleteISN_EEOS5_PNS1_11ProcessBaseEE_JSQ_S5_ST_EEEDTclclsr3stdE7forwardIS9_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOS9_DpOSV_ @ 0x4a871b _ZN6lambda8internal7PartialIZN7process8dispatchI7Nothing15DispatchProcessRKNS5_8CopyableES8_EENS2_6FutureIT_EERKNS2_3PIDIT0_EEMSD_FSB_T1_EOT2_EUlSt10unique_ptrINS2_7PromiseIS4_EESt14default_deleteISO_EEOS6_PNS2_11ProcessBaseEE_JSR_S6_St12_PlaceholderILi113invoke_expandISV_St5tupleIJSR_S6_SX_EES10_IJOSU_EEJLm0ELm1ELm2DTclsr5cpp17E6invokeclsr3stdE7forwardISA_Efp_Espcl6expandclsr3stdE3getIXT2_EEclsr3stdE7forwardISD_Efp0_EEclsr3stdE7forwardISH_Efp2_OSA_OSD_N5cpp1416integer_sequenceImJXspT2_OSH_ @ 0x4a864e _ZNO6lambda8internal7PartialIZN7process8dispatchI7Nothing15DispatchProcessRKNS5_8CopyableES8_EENS2_6FutureIT_EERKNS2_3PIDIT0_EEMSD_FSB_T1_EOT2_EUlSt10unique_ptrINS2_7PromiseIS4_EESt14default_deleteISO_EEOS6_PNS2_11ProcessBaseEE_JSR_S6_St12_PlaceholderILi1clIJSU_EEEDTcl13invoke_expandclL_ZSt4moveIRSV_EONSt16remove_referenceISA_E4typeEOSA_EdtdefpT1fEclL_ZS10_IRSt5tupleIJSR_S6_SX_EEES15_S16_EdtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1ELm2_Eclsr3stdE16forward_as_tuplespclsr3stdE7forwardIT_Efp_DpOS1D_ @ 0x4a85e2 _ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchI7Nothing15DispatchProcessRKNS7_8CopyableESA_EENS4_6FutureIT_EERKNS4_3PIDIT0_EEMSF_FSD_T1_EOT2_EUlSt10unique_ptrINS4_7PromiseIS6_EESt14default_deleteISQ_EEOS8_PNS4_11ProcessBaseEE_JST_S8_St12_PlaceholderILi1EJSW_EEEDTclclsr3stdE7forwardISC_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSC_DpOS11_ @ 0x4a85a6 _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8dispatchI7Nothing15DispatchProcessRKNS8_8CopyableESB_EENS5_6FutureIT_EERKNS5_3PIDIT0_EEMSG_FSE_T1_EOT2_EUlSt10unique_ptrINS5_7PromiseIS7_EESt14default_deleteISR_EEOS9_PNS5_11ProcessBaseEE_JSU_S9_St12_PlaceholderILi1EJSX_EEEvOSD_DpOT0_ @ 0x4a855d _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchI7Nothing15DispatchProcessRKNSB_8CopyableESE_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSJ_FSH_T1_EOT2_EUlSt10unique_ptrINS1_7PromiseISA_EESt14default_deleteISU_EEOSC_S3_E_JSX_SC_St12_PlaceholderILi1EEclEOS3_ @ 0x721f58 _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_ @ 0x721e19 process::ProcessBase::consume() @ 0x780169 _ZNO7process13DispatchEvent7consumeEPNS_13EventConsumerE @ 0x41d7f4 process::ProcessBase::serve() @ 0x71d315 process::ProcessManager::resume() @ 0x7d8d8e process::ProcessManager::init_threads()::$_8::operator()() @ 0x7d8c4d
[jira] [Created] (MESOS-9137) GRPC build fails to pass compiler flags
James Peach created MESOS-9137: -- Summary: GRPC build fails to pass compiler flags Key: MESOS-9137 URL: https://issues.apache.org/jira/browse/MESOS-9137 Project: Mesos Issue Type: Bug Components: build Reporter: James Peach The GRPC build integration fails to pass compiler flags down from the main build into the GRPC component build. This can make the build fail in surprising ways. For example, if you use {{CXXFLAGS="-fsanitize=thread" CFLAGS="-fsanitize=tthread"}}, the build fails because of the inconsistent application of these flags across bundled components. In this build log, libprotobuf was built using the correct flags, which then causes GRPC to fail because it is missing the flags: {noformat} make[3]: Entering directory '/home/jpeach/src/asf-mesos/build/3rdparty' 20 cd grpc-1.10.0 && \ 19 CPPFLAGS="-I/home/jpeach/src/asf-mesos/build/3rdparty/protobuf-3.5.0/src \ 18 \ 17 \ 16 -Wno-array-bounds" \ 15 make \ 14 /home/jpeach/src/asf-mesos/build/3rdparty/grpc-1.10.0/libs/opt/libgrpc++.a /home/jpeach/src/asf-mesos/build/3rdparty/grpc-1 .10.0/libs/opt/libgrpc.a /home/jpeach/src/asf-mesos/build/3rdparty/grpc-1.10.0/libs/opt/libgpr.a \ 13 CC="/home/jpeach/src/asf-mesos/build/cc" \ 12 CXX="/home/jpeach/src/asf-mesos/build/c++" \ 11 LD="/home/jpeach/src/asf-mesos/build/cc" \ 10 LDXX="/home/jpeach/src/asf-mesos/build/c++" \ 9 LDFLAGS="-L/home/jpeach/src/asf-mesos/build/3rdparty/protobuf-3.5.0/src/.libs \ 8 \ 7 " \ 6 HAS_PKG_CONFIG=false\ 5 NO_PROTOC=false \ 4 PROTOC="/home/jpeach/src/asf-mesos/build/3rdparty/protobuf-3.5.0/src/protoc" 3 make[4]: Entering directory '/home/jpeach/src/asf-mesos/build/3rdparty/grpc-1.10.0' 2 mkdir -p `dirname /home/jpeach/src/asf-mesos/build/3rdparty/grpc-1.10.0/bins/opt/grpc_cpp_plugin` 1 /home/jpeach/src/asf-mesos/build/c++ -L/home/jpeach/src/asf-mesos/build/3rdparty/protobuf-3.5.0/src/.libs /home/jpeach/src/asf-mesos/build/3rdparty/grpc-1.10.0/objs/opt/src/compiler/cpp_plugin.o /home/j peach/src/asf-mesos/build/3rdparty/grpc-1.10.0/libs/opt/libgrpc_plugin_support.a -lprotoc -lprotobuf -ldl -lrt -lm -lpthread - lz -lprotoc -lprotobuf -o /home/jpeach/src/asf-mesos/build/3rdparty/grpc-1.10.0/bins/opt/grpc_cpp_plugin 31 /home/jpeach/src/asf-mesos/build/3rdparty/protobuf-3.5.0/src/.libs/libprotoc.a(code_generator.o): In function `__cxx_global_var _init': 1 code_generator.cc:(.text.startup+0xd): undefined reference to `__tsan_func_entry' 2 code_generator.cc:(.text.startup+0x43): undefined reference to `__tsan_func_exit' 3 code_generator.cc:(.text.startup+0x57): undefined reference to `__tsan_func_exit' 4 /home/jpeach/src/asf-mesos/build/3rdparty/protobuf-3.5.0/src/.libs/libprotoc.a(code_generator.o): In function `_GLOBAL__sub_I_c ode_generator.cc': 5 code_generator.cc:(.text.startup+0x7d): undefined reference to `__tsan_func_entry' 6 code_generator.cc:(.text.startup+0x8c): undefined reference to `__tsan_func_exit' 7 code_generator.cc:(.text.startup+0xa0): undefined reference to `__tsan_func_exit' 8 /home/jpeach/src/asf-mesos/build/3rdparty/protobuf-3.5.0/src/.libs/libprotoc.a(code_generator.o): In function `google::protobuf ::compiler::CodeGenerator::~CodeGenerator()': 9 code_generator.cc:(.text._ZN6google8protobuf8compiler13CodeGeneratorD0Ev+0x14): undefined reference to `__tsan_func_entry' 10 /home/jpeach/src/asf-mesos/build/3rdparty/protobuf-3.5.0/src/.libs/libprotoc.a(code_generator.o): In function `google::protobuf ::compiler::CodeGenerator::GenerateAll(std::vector > const&, std::__cxx11::basic_string, std::allocator > const&, google:: protobuf::compiler::GeneratorContext*, std::__cxx11::basic_string, std::allocator >*) const' : {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9115) Stout depends on missing rapidjson headers.
[ https://issues.apache.org/jira/browse/MESOS-9115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559953#comment-16559953 ] James Peach commented on MESOS-9115: Summoning [~bmahler] > Stout depends on missing rapidjson headers. > --- > > Key: MESOS-9115 > URL: https://issues.apache.org/jira/browse/MESOS-9115 > Project: Mesos > Issue Type: Bug > Components: build >Reporter: James Peach >Priority: Major > > Stout depends on {{}} and {{}}, > and these eventually depend on files in {{}}. When we > install Mesos, we aren't installing the rapidjson internal headers, which > breaks the build for external Mesos modules. > {noformat} > 05:54:07 - In file included from /usr/include/stout/jsonify.hpp:36:0, > 05:54:07 - from /usr/include/stout/json.hpp:41, > 05:54:07 - from /usr/include/mesos/resources.hpp:37, > 05:54:07 - from /usr/include/mesos/slave/isolator.hpp:23, > 05:54:07 - from /usr/include/mesos/module/isolator.hpp:23, > 05:54:07 - from src/isolator.cc:8: > 05:54:07 - /usr/include/rapidjson/stringbuffer.h:19:28: fatal error: > internal/stack.h: No such file or directory > 05:54:07 - #include "internal/stack.h" > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9115) Stout depends on missing rapidjson headers.
James Peach created MESOS-9115: -- Summary: Stout depends on missing rapidjson headers. Key: MESOS-9115 URL: https://issues.apache.org/jira/browse/MESOS-9115 Project: Mesos Issue Type: Bug Components: build Reporter: James Peach Stout depends on {{}} and {{}}, and these eventually depend on files in {{}}. When we install Mesos, we aren't installing the rapidjson internal headers, which breaks the build for external Mesos modules. {noformat} 05:54:07 - In file included from /usr/include/stout/jsonify.hpp:36:0, 05:54:07 - from /usr/include/stout/json.hpp:41, 05:54:07 - from /usr/include/mesos/resources.hpp:37, 05:54:07 - from /usr/include/mesos/slave/isolator.hpp:23, 05:54:07 - from /usr/include/mesos/module/isolator.hpp:23, 05:54:07 - from src/isolator.cc:8: 05:54:07 - /usr/include/rapidjson/stringbuffer.h:19:28: fatal error: internal/stack.h: No such file or directory 05:54:07 - #include "internal/stack.h" {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-9065) Apply the `override` keyword globally.
[ https://issues.apache.org/jira/browse/MESOS-9065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16537947#comment-16537947 ] James Peach edited comment on MESOS-9065 at 7/10/18 3:38 AM: - |[https://reviews.apache.org/r/67866/] | Apply the `override` keyword to stout.| |[https://reviews.apache.org/r/67867/] | Apply the `override` keyword to libprocess.| |[https://reviews.apache.org/r/67868/] |Apply the `override` keyword to Mesos. | |[https://reviews.apache.org/r/67869/] |Add use of `override` to the Mesos C++ style guide. | was (Author: jamespeach): |https://reviews.apache.org/r/67866/ | Apply the `override` keyword to stout.| |https://reviews.apache.org/r/67867/ | Apply the `override` keyword to libprocess.| |https://reviews.apache.org/r/67868/ |Apply the `override` keyword to Mesos. | |https://reviews.apache.org/r/67869/ |Add use of `override` to the Mesos C++ style guide. | > Apply the `override` keyword globally. > -- > > Key: MESOS-9065 > URL: https://issues.apache.org/jira/browse/MESOS-9065 > Project: Mesos > Issue Type: Bug >Reporter: James Peach >Assignee: James Peach >Priority: Major > > As per [this > thread|https://lists.apache.org/thread.html/371c23ca743dbc354fcf440d1fa9e99c29f20602c5efd7dc563713a9@%3Cdev.mesos.apache.org%3E], > apply the {{override}} keyword globally. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9065) Apply the `override` keyword globally.
James Peach created MESOS-9065: -- Summary: Apply the `override` keyword globally. Key: MESOS-9065 URL: https://issues.apache.org/jira/browse/MESOS-9065 Project: Mesos Issue Type: Bug Reporter: James Peach Assignee: James Peach As per [this thread|https://lists.apache.org/thread.html/371c23ca743dbc354fcf440d1fa9e99c29f20602c5efd7dc563713a9@%3Cdev.mesos.apache.org%3E], apply the {{override}} keyword globally. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9057) Add a cmake option to disable -Werror.
James Peach created MESOS-9057: -- Summary: Add a cmake option to disable -Werror. Key: MESOS-9057 URL: https://issues.apache.org/jira/browse/MESOS-9057 Project: Mesos Issue Type: Bug Components: build, cmake Reporter: James Peach The autotools build has a {{\-\-disable-werror}} build option that disables the {{-Werror}} compile flag in Mesos and its dependencies. We need to so the same for cmake so that this doesn't block upgrading compilers or other dependencies. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9051) Move agent call validation into common validation library.
[ https://issues.apache.org/jira/browse/MESOS-9051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-9051: -- Assignee: James Peach | [https://reviews.apache.org/r/67830/] | Moved `executor::Call` validation to common validation library. | > Move agent call validation into common validation library. > -- > > Key: MESOS-9051 > URL: https://issues.apache.org/jira/browse/MESOS-9051 > Project: Mesos > Issue Type: Bug > Components: agent, build >Reporter: James Peach >Assignee: James Peach >Priority: Minor > > The executor driver calls {{executor::call::validate()}} from > {{src/slave/validation.cpp}}, which creates an upward dependency from > libmesos.so (where the executor driver has to live) to the agent. If we can > move the validation calls down to the common validation library, we can break > this dependency. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9040) Break scheduler driver dependency on mesos-local.
[ https://issues.apache.org/jira/browse/MESOS-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533262#comment-16533262 ] James Peach commented on MESOS-9040: [~benjaminhindman] [~tillt] What do you think about just removing this features? It's not documented and I don't know of anyone who uses it (no-one on the dev list responded, though we should try harder to let people know if we are going to remove it). With the advent of the HTTP API, maybe there are fewer users of the scheduler drivers, so this is less likely to benefit framework developers. I can also take an action to add some docs about integration testing with {{mesos-local}}. > Break scheduler driver dependency on mesos-local. > - > > Key: MESOS-9040 > URL: https://issues.apache.org/jira/browse/MESOS-9040 > Project: Mesos > Issue Type: Task > Components: build, scheduler driver >Reporter: James Peach >Priority: Minor > > The scheduler driver in {{src/sched/sched.cpp}} has some special dependencies > on the {{mesos-local}} code. This seems fairly hacky, but it also causes > binary dependencies on {{src/local/local.cpp}} to be dragged into > {{libmesos.so}}. {{libmesos.so}} would not otherwise require this code, which > could be isolated in the {{mesos-local}} command. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9041) Break agent dependencies out of libmesos.
[ https://issues.apache.org/jira/browse/MESOS-9041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532713#comment-16532713 ] James Peach commented on MESOS-9041: I got a rough prototype working and it does improve build times a little. I tested on a local VM (4CPU, 8G RAM), with a fully populated cache and {{make -j4}}. Unmodified build: {noformat} real9m23.702s user8m13.996s sys 3m32.028s {noformat} Agent dependencies broken into libmesos-agent.so: {noformat} real8m4.517s user7m23.865s sys 3m47.629s {noformat} So this looks like a nice improvement in at least one configuration. > Break agent dependencies out of libmesos. > - > > Key: MESOS-9041 > URL: https://issues.apache.org/jira/browse/MESOS-9041 > Project: Mesos > Issue Type: Task > Components: agent, build >Reporter: James Peach >Priority: Major > > {{libmesos.so}} includes all the dependencies for both the master and the > agent. This means that is has way more symbols than necessary (causing > inflated built times), and drags in dependencies (e.g. libnl.so, libblkid.so) > that are only necessary on the agent. We should attempt to separate the agent > code out of {{libmesos.so}}, which would improve the build cleanliness and > hopefully performance. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9051) Move agent call validation into common validation library.
James Peach created MESOS-9051: -- Summary: Move agent call validation into common validation library. Key: MESOS-9051 URL: https://issues.apache.org/jira/browse/MESOS-9051 Project: Mesos Issue Type: Bug Components: agent, build Reporter: James Peach The executor driver calls {{executor::call::validate()}} from {{src/slave/validation.cpp}}, which creates an upward dependency from libmesos.so (where the executor driver has to live) to the agent. If we can move the validation calls down to the common validation library, we can break this dependency. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-9040) Break scheduler driver dependency on mesos-local.
[ https://issues.apache.org/jira/browse/MESOS-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530919#comment-16530919 ] James Peach edited comment on MESOS-9040 at 7/3/18 7:25 AM: {quote} It is a convenience thing meant for framework developers - maybe we can achieve the same by exec'ing the mesos-local runnable if desired. {quote} Hmm, I never knew that. Our framework developers certainly don't know about it either. Do you know of anyone who does use it? Is there anything I can run to experiment with it? If framework developers wanted to use {{mesos-local}}, why wouldn't they just exec the `mesos-local` process in their CI? was (Author: jamespeach): {quote} It is a convenience thing meant for framework developers - maybe we can achieve the same by exec'ing the mesos-local runnable if desired. {quote} Hmm, I never knew that. Our framework developers certainly don't know about it either. Do you know of anyone who does use it? Is there anything I can run to experiment with it? If framework developers wanted to use {{mesos-local}}, why wouldn't they just exec the `mesos-local` process i their CI? > Break scheduler driver dependency on mesos-local. > - > > Key: MESOS-9040 > URL: https://issues.apache.org/jira/browse/MESOS-9040 > Project: Mesos > Issue Type: Task > Components: build, scheduler driver >Reporter: James Peach >Priority: Minor > > The scheduler driver in {{src/sched/sched.cpp}} has some special dependencies > on the {{mesos-local}} code. This seems fairly hacky, but it also causes > binary dependencies on {{src/local/local.cpp}} to be dragged into > {{libmesos.so}}. {{libmesos.so}} would not otherwise require this code, which > could be isolated in the {{mesos-local}} command. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9040) Break scheduler driver dependency on mesos-local.
[ https://issues.apache.org/jira/browse/MESOS-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530919#comment-16530919 ] James Peach commented on MESOS-9040: {quote} It is a convenience thing meant for framework developers - maybe we can achieve the same by exec'ing the mesos-local runnable if desired. {quote} Hmm, I never knew that. Our framework developers certainly don't know about it either. Do you know of anyone who does use it? Is there anything I can run to experiment with it? If framework developers wanted to use {{mesos-local}}, why wouldn't they just exec the `mesos-local` process i their CI? > Break scheduler driver dependency on mesos-local. > - > > Key: MESOS-9040 > URL: https://issues.apache.org/jira/browse/MESOS-9040 > Project: Mesos > Issue Type: Task > Components: build, scheduler driver >Reporter: James Peach >Priority: Minor > > The scheduler driver in {{src/sched/sched.cpp}} has some special dependencies > on the {{mesos-local}} code. This seems fairly hacky, but it also causes > binary dependencies on {{src/local/local.cpp}} to be dragged into > {{libmesos.so}}. {{libmesos.so}} would not otherwise require this code, which > could be isolated in the {{mesos-local}} command. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-9043) Move check validators to the common validation library.
[ https://issues.apache.org/jira/browse/MESOS-9043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529039#comment-16529039 ] James Peach edited comment on MESOS-9043 at 7/1/18 10:07 AM: - |[r/67794|https://reviews.apache.org/r/67794/]|Moved `validation::healthCheck` to common code.| |[r/67795|https://reviews.apache.org/r/67795/]|Moved `CheckInfo` validation to common code.| was (Author: jamespeach): |[r/67794|https://reviews.apache.org/r/67794/]|Moved `validation::healthCheck` to common code.| |[/r/67795|https://reviews.apache.org/r/67795/]|Moved `CheckInfo` validation to common code.| > Move check validators to the common validation library. > --- > > Key: MESOS-9043 > URL: https://issues.apache.org/jira/browse/MESOS-9043 > Project: Mesos > Issue Type: Task > Components: build >Reporter: James Peach >Assignee: James Peach >Priority: Major > > The {{src/checks}} library contains some protobuf validation APIs that are > also used by the master. This creates a build dependency where the master > depends on the checks library but doesn't actually use the checks. We can > break this dependency by pushing the validators down into > {{src/common/validation.cpp}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9043) Move check validators to the common validation library.
[ https://issues.apache.org/jira/browse/MESOS-9043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-9043: -- Assignee: James Peach > Move check validators to the common validation library. > --- > > Key: MESOS-9043 > URL: https://issues.apache.org/jira/browse/MESOS-9043 > Project: Mesos > Issue Type: Task > Components: build >Reporter: James Peach >Assignee: James Peach >Priority: Major > > The {{src/checks}} library contains some protobuf validation APIs that are > also used by the master. This creates a build dependency where the master > depends on the checks library but doesn't actually use the checks. We can > break this dependency by pushing the validators down into > {{src/common/validation.cpp}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9043) Move check validators to the common validation library.
James Peach created MESOS-9043: -- Summary: Move check validators to the common validation library. Key: MESOS-9043 URL: https://issues.apache.org/jira/browse/MESOS-9043 Project: Mesos Issue Type: Task Components: build Reporter: James Peach The {{src/checks}} library contains some protobuf validation APIs that are also used by the master. This creates a build dependency where the master depends on the checks library but doesn't actually use the checks. We can break this dependency by pushing the validators down into {{src/common/validation.cpp}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)