[jira] [Commented] (MESOS-9963) URI stringification constructs malformed URIs.

2019-09-18 Thread James Peach (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933051#comment-16933051
 ] 

James Peach commented on MESOS-9963:


Verified that this issue doesn't cause any problems in the current code, 
because callers are careful to ensure the path component begin with '/'

> URI stringification constructs malformed URIs.
> --
>
> Key: MESOS-9963
> URL: https://issues.apache.org/jira/browse/MESOS-9963
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>  Labels: containerization
>
> Setting {{docker_registry="https://docker-cache.example.com/}} and then 
> pulling an image named {{org/image-name:latest}} fails. The Docker image 
> puller ends up constructing a malformed URL for the manifest:
> {noformat}
> Pulling image 'org/siri-centos6:stage' from 
> 'docker-manifest://docker-cache.example.com:443org/image-name?latest#https' 
> to '/tmp/mesos/store/docker/staging/LGArHA'
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-9963) Docker puller malforms registry URLs.

2019-09-05 Thread James Peach (Jira)
James Peach created MESOS-9963:
--

 Summary: Docker puller malforms registry URLs.
 Key: MESOS-9963
 URL: https://issues.apache.org/jira/browse/MESOS-9963
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: James Peach
Assignee: James Peach


Setting {{docker_registry="https://docker-cache.example.com/}} and then pulling 
an image named {{org/image-name:latest}} fails. The Docker image puller ends up 
constructing a malformed URL for the manifest:

{noformat}
Pulling image 'org/siri-centos6:stage' from 
'docker-manifest://docker-cache.example.com:443org/image-name?latest#https' to 
'/tmp/mesos/store/docker/staging/LGArHA'
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (MESOS-4741) Add role information for static reservation in /master/roles

2019-08-28 Thread James Peach (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918205#comment-16918205
 ] 

James Peach commented on MESOS-4741:


MESOS-9888 is a duplicate.

> Add role information for static reservation in /master/roles
> 
>
> Key: MESOS-4741
> URL: https://issues.apache.org/jira/browse/MESOS-4741
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Reporter: Klaus Ma
>Priority: Major
>
> In {{/master/roles}}, it should show static reservation roles if there's no 
> tasks.
> {code}
> Klauss-MacBook-Pro:mesos klaus$ curl http://localhost:5050/master/roles.json 
> | python -m json.tool
>   % Total% Received % Xferd  Average Speed   TimeTime Time  
> Current
>  Dload  Upload   Total   SpentLeft  Speed
> 10093  100930 0  13907  0 --:--:-- --:--:-- --:--:-- 15500
> {
> "roles": [
> {
> "frameworks": [],
> "name": "*",
> "resources": {
> "cpus": 0,
> "disk": 0,
> "mem": 0
> },
> "weight": 1.0
> }
> ]
> }
> {code}
> After submit tasks to r1, it'll show roles.
> {code}
> Klauss-MacBook-Pro:mesos klaus$ curl http://localhost:5050/master/roles | 
> python -m json.tool
>   % Total% Received % Xferd  Average Speed   TimeTime Time  
> Current
>  Dload  Upload   Total   SpentLeft  Speed
> 100   221  100   2210 0  32721  0 --:--:-- --:--:-- --:--:-- 36833
> {
> "roles": [
> {
> "frameworks": [],
> "name": "*",
> "resources": {
> "cpus": 0,
> "disk": 0,
> "mem": 0
> },
> "weight": 1.0
> },
> {
> "frameworks": [
> "b4f15a2e-5d9a-4d31-a29e-7737af41c8e4-0002"
> ],
> "name": "r1",
> "resources": {
> "cpus": 1.0,
> "disk": 0,
> "mem": 0
> },
> "weight": 1.0
> }
> ]
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (MESOS-9935) The agent crashes after the disk du isolator supporting rootfs checks.

2019-08-12 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905794#comment-16905794
 ] 

James Peach commented on MESOS-9935:


This reproduces if you run a task without any disk resource.

> The agent crashes after the disk du isolator supporting rootfs checks.
> --
>
> Key: MESOS-9935
> URL: https://issues.apache.org/jira/browse/MESOS-9935
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: James Peach
>Priority: Blocker
>
> This issue was broken by this patch:
> https://github.com/apache/mesos/commit/8ba0682521c6051b42f33b3dd96a37f4d46a290d#diff-33089e53bdf9f646cdb9317c212eda02
> A task can be launched without disk resource. However, after this patch, if 
> the disk resource does not exist, the agent crashes - because the info->paths 
> only add an entry 'path' when there is a quota and the quota comes from the 
> disk resource.
> {noformat}
> Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal 
> mesos-agent[15492]: F0809 14:54:00.017730 15498 process.cpp:3057] Aborting 
> libprocess: 'posix-disk-isolator(1)@172.12.2.196:5051' threw exception: 
> _Map_base::at
> Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal 
> mesos-agent[15492]: *** Check failure stack trace: ***
> Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal 
> mesos-agent[15492]: @ 0x7f65f7d585cd  google::LogMessage::Fail()
> Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal 
> mesos-agent[15492]: @ 0x7f65f7d5a828  google::LogMessage::SendToLog()
> Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal 
> mesos-agent[15492]: @ 0x7f65f7d58163  google::LogMessage::Flush()
> Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal 
> mesos-agent[15492]: @ 0x7f65f7d5b169  
> google::LogMessageFatal::~LogMessageFatal()
> Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal 
> mesos-agent[15492]: @ 0x7f65f7cb8dbd  process::ProcessManager::resume()
> Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal 
> mesos-agent[15492]: @ 0x7f65f7cbe926  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal 
> mesos-agent[15492]: @ 0x7f65f3976070  (unknown)
> Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal 
> mesos-agent[15492]: @ 0x7f65f3194e25  start_thread
> Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal 
> mesos-agent[15492]: @ 0x7f65f2ebebad  __clone
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (MESOS-9935) The agent crashes after the disk du isolator supporting rootfs checks.

2019-08-12 Thread James Peach (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-9935:
--

Assignee: James Peach

> The agent crashes after the disk du isolator supporting rootfs checks.
> --
>
> Key: MESOS-9935
> URL: https://issues.apache.org/jira/browse/MESOS-9935
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: James Peach
>Priority: Blocker
>
> This issue was broken by this patch:
> https://github.com/apache/mesos/commit/8ba0682521c6051b42f33b3dd96a37f4d46a290d#diff-33089e53bdf9f646cdb9317c212eda02
> A task can be launched without disk resource. However, after this patch, if 
> the disk resource does not exist, the agent crashes - because the info->paths 
> only add an entry 'path' when there is a quota and the quota comes from the 
> disk resource.
> {noformat}
> Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal 
> mesos-agent[15492]: F0809 14:54:00.017730 15498 process.cpp:3057] Aborting 
> libprocess: 'posix-disk-isolator(1)@172.12.2.196:5051' threw exception: 
> _Map_base::at
> Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal 
> mesos-agent[15492]: *** Check failure stack trace: ***
> Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal 
> mesos-agent[15492]: @ 0x7f65f7d585cd  google::LogMessage::Fail()
> Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal 
> mesos-agent[15492]: @ 0x7f65f7d5a828  google::LogMessage::SendToLog()
> Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal 
> mesos-agent[15492]: @ 0x7f65f7d58163  google::LogMessage::Flush()
> Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal 
> mesos-agent[15492]: @ 0x7f65f7d5b169  
> google::LogMessageFatal::~LogMessageFatal()
> Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal 
> mesos-agent[15492]: @ 0x7f65f7cb8dbd  process::ProcessManager::resume()
> Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal 
> mesos-agent[15492]: @ 0x7f65f7cbe926  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal 
> mesos-agent[15492]: @ 0x7f65f3976070  (unknown)
> Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal 
> mesos-agent[15492]: @ 0x7f65f3194e25  start_thread
> Aug 09 14:54:00 ip-172-12-2-196.us-west-2.compute.internal 
> mesos-agent[15492]: @ 0x7f65f2ebebad  __clone
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-9875) Mesos did not respond correctly when operations should fail

2019-08-07 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16902552#comment-16902552
 ] 

James Peach commented on MESOS-9875:


{noformat}
f9330006-d885-4ef0-b2c7-c9c6fcc239e5 is the persistence ID.
5fa5c810-2dd3-41cb-9633-a3ef404b08c4 is the operation UUID.
honvr62494cqk_ff4e953f-0eca-4b41-a08d-ddea27980b14 is the operation ID.

I0627 22:03:17.360236 3529210 slave.cpp:4282] Updated checkpointed operations 
from [ cfd6b624-996f-45d7-9aaf-9a13ab9714b4 (RESERVE for framework 
efd8f75d-25a9-4346-8c7b-d8c8c95ba328-22525, ID: 
honvr62494cqk_a5b92fff-5491-4616-8970-8c390265c009, latest state: 
OPERATION_FINISHED) ] to [ cfd6b624-996f-45d7-9aaf-9a13ab9714b4 (RESERVE for 
framework efd8f75d-25a9-4346-8c7b-d8c8c95ba328-22525, ID: 
honvr62494cqk_a5b92fff-5491-4616-8970-8c390265c009, latest state: 
OPERATION_FINISHED), 5fa5c810-2dd3-41cb-9633-a3ef404b08c4 (CREATE for framework 
efd8f75d-25a9-4346-8c7b-d8c8c95ba328-22525, ID: 
honvr62494cqk_ff4e953f-0eca-4b41-a08d-ddea27980b14, latest state: 
OPERATION_PENDING) ]

I0627 22:03:17.360723 3529210 slave.cpp:8670] Updating the state of operation 
'honvr62494cqk_ff4e953f-0eca-4b41-a08d-ddea27980b14' (uuid: 
5fa5c810-2dd3-41cb-9633-a3ef404b08c4) for framework 
efd8f75d-25a9-4346-8c7b-d8c8c95ba328-22525 (latest state: OPERATION_FINISHED, 
status update state: OPERATION_FINISHED)

E0627 22:03:17.365811 3529210 slave.cpp:4257] EXIT with status 1: Failed to 
sync checkpointed resources: Failed to create the persistent volume 
f9330006-d885-4ef0-b2c7-c9c6fcc239e5 at 
'/srv/mesos/work/volumes/roles/test-3/f9330006-d885-4ef0-b2c7-c9c6fcc239e5': 
Operation not permitted

{noformat}

> Mesos did not respond correctly when operations should fail
> ---
>
> Key: MESOS-9875
> URL: https://issues.apache.org/jira/browse/MESOS-9875
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Yifan Xing
>Assignee: Greg Mann
>Priority: Major
>  Labels: foundations, mesosphere
> Attachments: Screen Shot 2019-06-27 at 15.07.20.png
>
>
> For testing persistent volumes with {{OPERATION_FAILED/ERROR}} feedbacks, we 
> sshed into the mesos-agent and made it unable to create subdirectories in 
> {{/srv/mesos/work/volumes}}, however, mesos did not respond any operation 
> failed response. Instead, we received {{OPERATION_FINISHED}} feedback.
> Steps to recreate the issue:
> 1. Ssh into a magent.
>  2. Make it impossible to create a persistent volume (we expect the agent to 
> crash and reregister, and the master to release that the operation is 
> {{OPERATION_DROPPED}}):
>  * cd /srv/mesos/work (if it doesn't exist mkdir /srv/mesos/work/volumes)
>  * chattr -RV +i volumes (then no subdirectories can be created)
> 3. Launch a service with persistent volumes with the constraint of only using 
> the magent modified above.
>  
>  
> Logs for the scheduler for receiving `OPERATION_FINISHED`:
> (Also see screenshot)
>  
> 2019-06-27 21:57:11.879 [12768651|rdar://12768651] 
> [Jarvis-mesos-dispatcher-105] INFO c.a.j.s.ServicePodInstance - Stored 
> operation=4g3k02s1gjb0q_5f912b59-a32d-462c-9c46-8401eba4d2c1 and 
> feedback=OPERATION_FINISHED in podInstanceID=4g3k02s1gjb0q on 
> serviceID=yifan-badagents-1
>  
> * 2019-06-27 21:55:23: task reached state TASK_FAILED for mesos reason: 
> REASON_CONTAINER_LAUNCH_FAILED with mesos message: Failed to launch 
> container: Failed to change the ownership of the persistent volume at 
> '/srv/mesos/work/volumes/roles/test-2/19b564e8-3a90-4f2f-981d-b3dd2a5d9f90' 
> with uid 264 and gid 264: No such file or directory



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-7580) Use root fs as lower RO layer and container fs as upper layer

2019-07-30 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895849#comment-16895849
 ] 

James Peach commented on MESOS-7580:


I think that MESOS-9900 is related to this request. In MESOS-9900, any changes 
to the overlayfs upperdir will be charged to the container disk quota.

> Use root fs as lower RO layer and container fs as upper layer
> -
>
> Key: MESOS-7580
> URL: https://issues.apache.org/jira/browse/MESOS-7580
> Project: Mesos
>  Issue Type: Wish
>  Components: containerization
>Reporter: Mikhail Lesyk
>Priority: Major
>
> See example:
> {code}
> mkdir -p rootfs/{opt,container,workdir,result}
> mount -t overlay -o 
> lowerdir=rootfs,upperdir=rootfs/container,workdir=rootfs/workdir none 
> rootfs/result
> touch rootfs/result/opt/trash
> umount rootfs/result
> ls -a rootfs/opt/
> .  ..
> {code}
> Where rootfs - imaginary root filesystem
> rootfs/opt - variable directory on that filesystem
> rootfs/container - container work dir
> rootfs/result - result overlayfs mountpoint(root fs from container point of 
> view)
> So, any change under rootfs/result will be not visible from rootfs point of 
> view and it will remain clean, so every container could have own snapshot of 
> host's root filesystem, but changes would be individual.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (MESOS-9900) Include overlayfs upperdir in disk quota accounting.

2019-07-22 Thread James Peach (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-9900:
--

Assignee: James Peach

> Include overlayfs upperdir in disk quota accounting.
> 
>
> Key: MESOS-9900
> URL: https://issues.apache.org/jira/browse/MESOS-9900
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization, storage
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> Currently, the overlayfs upperdir is not included in any disk quota 
> accounting. This means that a task can write arbitrary amounts of data to 
> /tmp and will escape the sandbox disk quota.
> Propose that we propagate the overlayfs upperdir directory to the disk 
> isolators so that they can manage this storage, and include it in the total 
> sandbox usage quota. This would need to be supported by both {{disk/du}} and 
> {{disk/xfs}} isolators. We should be able to propagate the additional 
> information out of the provisioner in {{ProvisionInfo}} and then into 
> {{ContainerConfig}}.
> The proposed semantics would be that both the sandbox and overlayfs upperdir 
> usage would count towards the ephemeral disk quota.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (MESOS-9900) Include overlays upperdir in disk quota accounting.

2019-07-22 Thread James Peach (JIRA)
James Peach created MESOS-9900:
--

 Summary: Include overlays upperdir in disk quota accounting.
 Key: MESOS-9900
 URL: https://issues.apache.org/jira/browse/MESOS-9900
 Project: Mesos
  Issue Type: Improvement
  Components: containerization, storage
Reporter: James Peach


Currently, the overlayfs upperdir is not included in any disk quota accounting. 
This means that a task can write arbitrary amounts of data to /tmp and will 
escape the sandbox disk quota.

Propose that we propagate the overlayfs upperdir directory to the disk 
isolators so that they can manage this storage, and include it in the total 
sandbox usage quota. This would need to be supported by both {{disk/du}} and 
{{disk/xfs}} isolators. We should be able to propagate the additional 
information out of the provisioner in {{ProvisionInfo}} and then into 
{{ContainerConfig}}.

The proposed semantics would be that both the sandbox and overlayfs upperdir 
usage would count towards the ephemeral disk quota.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-9898) Add framework control over the no-new-privileges flag.

2019-07-18 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888461#comment-16888461
 ] 

James Peach commented on MESOS-9898:


/cc [~jjanco]

> Add framework control over the no-new-privileges flag.
> --
>
> Key: MESOS-9898
> URL: https://issues.apache.org/jira/browse/MESOS-9898
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization, HTTP API
>Reporter: James Peach
>Priority: Major
>
> Following on from MESOS-9770, we can add framework control over whether the 
> no-new-privileges flag. 
> The implementation is to add a `no_new_privileges` boolean to the 
> {{SeccompInfo}} message that will allow a framework to toggle it on and off. 
> This means that the seccomp isolator must be ordered after the nnp isolator 
> so that it has priority (last writer wins in a protobuf merge). The nnp 
> isolator will still unconditionally set the flag.
> Design doc: 
> https://docs.google.com/document/d/1x9S94-P0-nsXHGrwY4BHZ_NEC_bTFMIsDkxxaTd5Vok/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (MESOS-9898) Add framework control over the no-new-privileges flag.

2019-07-18 Thread James Peach (JIRA)
James Peach created MESOS-9898:
--

 Summary: Add framework control over the no-new-privileges flag.
 Key: MESOS-9898
 URL: https://issues.apache.org/jira/browse/MESOS-9898
 Project: Mesos
  Issue Type: Improvement
  Components: containerization, HTTP API
Reporter: James Peach


Following on from MESOS-9770, we can add framework control over whether the 
no-new-privileges flag. 

The implementation is to add a `no_new_privileges` boolean to the 
{{SeccompInfo}} message that will allow a framework to toggle it on and off. 
This means that the seccomp isolator must be ordered after the nnp isolator so 
that it has priority (last writer wins in a protobuf merge). The nnp isolator 
will still unconditionally set the flag.

Design doc: 
https://docs.google.com/document/d/1x9S94-P0-nsXHGrwY4BHZ_NEC_bTFMIsDkxxaTd5Vok/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-9770) Add no-new-privileges isolator.

2019-07-18 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888455#comment-16888455
 ] 

James Peach commented on MESOS-9770:


| https://reviews.apache.org/r/71106/ |
| https://reviews.apache.org/r/70757/| 
| https://reviews.apache.org/r/71107/ |

> Add no-new-privileges isolator.
> ---
>
> Key: MESOS-9770
> URL: https://issues.apache.org/jira/browse/MESOS-9770
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James Peach
>Assignee: Jacob Janco
>Priority: Major
>
> To give security-minded operators more defense in depth, add a {{linux/nnp}} 
> isolator that sets the no-new-privileges bit before starting the executor.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-9875) Mesos did not respond correctly when operations should fail

2019-07-02 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876736#comment-16876736
 ] 

James Peach commented on MESOS-9875:


{{f9330006-d885-4ef0-b2c7-c9c6fcc239e5}} is the persistence ID.
{{5fa5c810-2dd3-41cb-9633-a3ef404b08c4}} is the operation UUID.
{{honvr62494cqk_ff4e953f-0eca-4b41-a08d-ddea27980b14}} is the operation ID.

{noformat}

I0627 22:03:17.360236 3529210 slave.cpp:4282] Updated checkpointed operations 
from [ cfd6b624-996f-45d7-9aaf-9a13ab9714b4 (RESERVE for framework 
efd8f75d-25a9-4346-8c7b-d8c8c95ba328-22525, ID: 
honvr62494cqk_a5b92fff-5491-4616-8970-8c390265c009, latest state: 
OPERATION_FINISHED) ] to [ cfd6b624-996f-45d7-9aaf-9a13ab9714b4 (RESERVE for 
framework efd8f75d-25a9-4346-8c7b-d8c8c95ba328-22525, ID: 
honvr62494cqk_a5b92fff-5491-4616-8970-8c390265c009, latest state: 
OPERATION_FINISHED), 5fa5c810-2dd3-41cb-9633-a3ef404b08c4 (CREATE for framework 
efd8f75d-25a9-4346-8c7b-d8c8c95ba328-22525, ID: 
honvr62494cqk_ff4e953f-0eca-4b41-a08d-ddea27980b14, latest state: 
OPERATION_PENDING) ]
...
I0627 22:03:17.360723 3529210 slave.cpp:8670] Updating the state of operation 
'honvr62494cqk_ff4e953f-0eca-4b41-a08d-ddea27980b14' (uuid: 
5fa5c810-2dd3-41cb-9633-a3ef404b08c4) for framework 
efd8f75d-25a9-4346-8c7b-d8c8c95ba328-22525 (latest state: OPERATION_FINISHED, 
status update state: OPERATION_FINISHED)
...
E0627 22:03:17.365811 3529210 slave.cpp:4257] EXIT with status 1: Failed to 
sync checkpointed resources: Failed to create the persistent volume 
f9330006-d885-4ef0-b2c7-c9c6fcc239e5 at 
'/srv/mesos/work/volumes/roles/test-3/f9330006-d885-4ef0-b2c7-c9c6fcc239e5': 
Operation not permitted
{noformat}


The relevant code sequence is in Slave::applyOperation, and looks roughly like 
this:

{noformat}
track the new operation

checkpointResourceState() (1)

apply the operation (2)
report that the operation was applied

checkpointResourceState() (3)
{noformat}

The operation is checkpointed as pending in (1), but no resource changes are 
made yet. In (3), the operation is applied by making changes to the agent 
resources. At (3) the checkpointed resources discrepancy is discovered and the 
agent tries to create the persistent volume and fails.


> Mesos did not respond correctly when operations should fail
> ---
>
> Key: MESOS-9875
> URL: https://issues.apache.org/jira/browse/MESOS-9875
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yifan Xing
>Priority: Major
>
> For testing persistent volumes with `OPERATION_FAILED/ERROR` feedbacks, we 
> sshed into the mesos-agent and made it unable to create subdirectories in 
> /srv/mesos/work/volumes, however, mesos did not respond any operation failed 
> response. Instead, we received `OPERATION_FINISHED` feedback.
> Steps to recreate the issue:
> 1. Ssh into a magent.
> 2. Make it impossible to create a persistent volume (we expect the agent to 
> crash and reregister, and the master to release that the operation is 
> `OPERATION_DROPPED`):
> * cd /srv/mesos/work (if it doesn't exist mkdir /srv/mesos/work/volumes)
>  * chattr -RV +i volumes (then no subdirectories can be created)
> 3. Launch a service with persistent volumes with the constraint of only using 
> the magent modified above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9800) libarchive cannot extract tarfile due to UTF-8 encoding issues

2019-06-19 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16868084#comment-16868084
 ] 

James Peach commented on MESOS-9800:


Sorry it took so long to get back to you [~falfaro]. We are carrying a revert 
of 2198b961d24b788564d36490cf52f78d7ec07655 

> libarchive cannot extract tarfile due to UTF-8 encoding issues
> --
>
> Key: MESOS-9800
> URL: https://issues.apache.org/jira/browse/MESOS-9800
> Project: Mesos
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.7.2
> Environment: Mesos 1.7.2 and Marathon 1.4.3 running on top of Ubuntu 
> 16.04.
>Reporter: Felipe Alfaro Solana
>Priority: Major
> Attachments: certificates2.tar.gz
>
>
> Starting with Mesos 1.7, the following change has been introduced:
>  * [MESOS-8064] - Mesos now requires libarchive to programmatically decode 
> .zip, .tar, .gzip, and other common file compression schemes. Version 3.3.2 
> is bundled in Mesos.
> However, this version of libarchive which is used by the fetcher component in 
> Mesos has problems in dealing with archive files (.tar and .zip) which 
> contain UTF-8 characters. We run Marahton on top of Mesos, and one of our 
> Marathon application relies on a .tar file which contains symlinks whose 
> target contains certain UTF-8 characters (Turkish) or the symlink name itself 
> contains UTF-8 characters. Mesos fetcher is unable to extract the archive and 
> fails with the following error:
> {{May 28 10:47:30 t01m01.node.t01.dns.teralytics.net mesos-slave[4319]: E0528 
> 10:47:30.791250  6136 fetcher.cpp:613] EXIT with status 1: Failed to fetch 
> '/tmp/certificates.tar.gz': Failed to extract archive 
> '/var/mesos/slaves/10c35371-f690-4d40-8b9e-30ffd04405fb-S6/frameworks/ff2993eb-987f-47b0-b3af-fb8b49ab0470-/executors/test-nginx.fe01a0c0-8135-11e9-a160-02427a38aa03/runs/6a6e87e8-5eef-4e8e-8c00-3f081fa187b0/certificates.tar.gz'
>  to 
> '/var/mesos/slaves/10c35371-f690-4d40-8b9e-30ffd04405fb-S6/frameworks/ff2993eb-987f-47b0-b3af-fb8b49ab0470-/executors/test-nginx.fe01a0c0-8135-11e9-a160-02427a38aa03/runs/6a6e87e8-5eef-4e8e-8c00-3f081fa187b0':
>  Failed to read archive header: Linkname can't be converted from UTF-8 to 
> current locale.}}
> {{May 28 10:47:30 t01m01.node.t01.dns.teralytics.net mesos-slave[4319]:}}
> {{May 28 10:47:30 t01m01.node.t01.dns.teralytics.net mesos-slave[4319]: End 
> fetcher log for container 6a6e87e8-5eef-4e8e-8c00-3f081fa187b0}}
> {{May 28 10:47:30 t01m01.node.t01.dns.teralytics.net mesos-slave[4319]: E0528 
> 10:47:30.846695  4343 fetcher.cpp:571] Failed to run mesos-fetcher: Failed to 
> fetch all URIs for container '6a6e87e8-5eef-4e8e-8c00-3f081fa187b0': exited 
> with status 1}}
> The same Marathon application works fine with Mesos 1.6 which does not use 
> libarchive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9804) Subprocess should close inherited file descriptors earlier.

2019-06-17 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16865376#comment-16865376
 ] 

James Peach commented on MESOS-9804:


This is not to be fixed.

The current code doesn't close after the fork, but does mark the inherited 
descriptors {{CLOEXEC}}. If we close these instead, then it would be harder for 
subprocess hooks to pass a fd into the child and use it in a child hook, which 
is a legitimate and useful pattern. If we don't close it, then we have the same 
semantics as today. So I think that the current code works correctly.

> Subprocess should close inherited file descriptors earlier.
> ---
>
> Key: MESOS-9804
> URL: https://issues.apache.org/jira/browse/MESOS-9804
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: James Peach
>Priority: Major
>
> The libprocess {{subprocess}} API doesn't close the file descriptors that are 
> inherited across fork until after applying the child hooks. This means that 
> the inherited descriptors can remain open for much longer than you expect, 
> since parent and child hooks both need to be scheduled and run.
> We should move the file descriptor closing as early as possible in the child. 
> We might also consider having the child write a byte back to the parent so 
> that we have a guaranteed synchronization point.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9848) Blkio cgroup statistics files missing in Linux 5.1

2019-06-16 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16865230#comment-16865230
 ] 

James Peach commented on MESOS-9848:


/cc [~jieyu] [~gilbert] [~qianzhang]

> Blkio cgroup statistics files missing in Linux 5.1
> --
>
> Key: MESOS-9848
> URL: https://issues.apache.org/jira/browse/MESOS-9848
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James Peach
>Priority: Major
>
> In recent Fedora release, the Linux blkio cgroup no longer publishes certain 
> stats files that the Mesos isolator expects should exist.
> In {{BlkioSubsystemProcess::usage}}, the isolator looks for
> * {{blkio.time}}
> * {{blkio.sectors}}
> * {{blkio.io_merged}}
> * {{blkio.io_queued}} 
> Here's the actual cgroup:
> {noformat}
> $ uname -r
> 5.1.8-300.fc30.x86_64
> ...
> [root@jpeach 184cf411-e73f-4c6e-bd54-8181222801af]# pwd
> /sys/fs/cgroup/blkio/mesos_test_c83596ce-76ff-47c8-b23d-1276c16e93ae/184cf411-e73f-4c6e-bd54-8181222801af
> [root@jpeach 184cf411-e73f-4c6e-bd54-8181222801af]# ls -l
> total 0
> -r--r--r-- 1 root root 0 Jun 16 18:07 blkio.bfq.io_service_bytes
> -r--r--r-- 1 root root 0 Jun 16 18:07 blkio.bfq.io_service_bytes_recursive
> -r--r--r-- 1 root root 0 Jun 16 18:07 blkio.bfq.io_serviced
> -r--r--r-- 1 root root 0 Jun 16 18:07 blkio.bfq.io_serviced_recursive
> -rw-r--r-- 1 root root 0 Jun 16 18:07 blkio.bfq.weight
> --w--- 1 root root 0 Jun 16 18:07 blkio.reset_stats
> -r--r--r-- 1 root root 0 Jun 16 18:07 blkio.throttle.io_service_bytes
> -r--r--r-- 1 root root 0 Jun 16 18:07 
> blkio.throttle.io_service_bytes_recursive
> -r--r--r-- 1 root root 0 Jun 16 18:07 blkio.throttle.io_serviced
> -r--r--r-- 1 root root 0 Jun 16 18:07 blkio.throttle.io_serviced_recursive
> -rw-r--r-- 1 root root 0 Jun 16 18:07 blkio.throttle.read_bps_device
> -rw-r--r-- 1 root root 0 Jun 16 18:07 blkio.throttle.read_iops_device
> -rw-r--r-- 1 root root 0 Jun 16 18:07 blkio.throttle.write_bps_device
> -rw-r--r-- 1 root root 0 Jun 16 18:07 blkio.throttle.write_iops_device
> -rw-r--r-- 1 root root 0 Jun 16 18:07 cgroup.clone_children
> -rw-r--r-- 1 root root 0 Jun 16 18:06 cgroup.procs
> -rw-r--r-- 1 root root 0 Jun 16 18:07 notify_on_release
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9848) Blkio cgroup statistis files missing in Linux 5.1

2019-06-16 Thread James Peach (JIRA)
James Peach created MESOS-9848:
--

 Summary: Blkio cgroup statistis files missing in Linux 5.1
 Key: MESOS-9848
 URL: https://issues.apache.org/jira/browse/MESOS-9848
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: James Peach


In recent Fedora release, the Linux blkio cgroup no longer publishes certain 
stats files that the Mesos isolator expects should exist.

In {{BlkioSubsystemProcess::usage}}, the isolator looks for
* {{blkio.time}}
* {{blkio.sectors}}
* {{blkio.io_merged}}
* {{blkio.io_queued}} 

Here's the actual cgroup:
{noformat}
$ uname -r
5.1.8-300.fc30.x86_64
...
[root@jpeach 184cf411-e73f-4c6e-bd54-8181222801af]# pwd
/sys/fs/cgroup/blkio/mesos_test_c83596ce-76ff-47c8-b23d-1276c16e93ae/184cf411-e73f-4c6e-bd54-8181222801af

[root@jpeach 184cf411-e73f-4c6e-bd54-8181222801af]# ls -l
total 0
-r--r--r-- 1 root root 0 Jun 16 18:07 blkio.bfq.io_service_bytes
-r--r--r-- 1 root root 0 Jun 16 18:07 blkio.bfq.io_service_bytes_recursive
-r--r--r-- 1 root root 0 Jun 16 18:07 blkio.bfq.io_serviced
-r--r--r-- 1 root root 0 Jun 16 18:07 blkio.bfq.io_serviced_recursive
-rw-r--r-- 1 root root 0 Jun 16 18:07 blkio.bfq.weight
--w--- 1 root root 0 Jun 16 18:07 blkio.reset_stats
-r--r--r-- 1 root root 0 Jun 16 18:07 blkio.throttle.io_service_bytes
-r--r--r-- 1 root root 0 Jun 16 18:07 blkio.throttle.io_service_bytes_recursive
-r--r--r-- 1 root root 0 Jun 16 18:07 blkio.throttle.io_serviced
-r--r--r-- 1 root root 0 Jun 16 18:07 blkio.throttle.io_serviced_recursive
-rw-r--r-- 1 root root 0 Jun 16 18:07 blkio.throttle.read_bps_device
-rw-r--r-- 1 root root 0 Jun 16 18:07 blkio.throttle.read_iops_device
-rw-r--r-- 1 root root 0 Jun 16 18:07 blkio.throttle.write_bps_device
-rw-r--r-- 1 root root 0 Jun 16 18:07 blkio.throttle.write_iops_device
-rw-r--r-- 1 root root 0 Jun 16 18:07 cgroup.clone_children
-rw-r--r-- 1 root root 0 Jun 16 18:06 cgroup.procs
-rw-r--r-- 1 root root 0 Jun 16 18:07 notify_on_release
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9805) Run cgroup subsystems before moving the target PID.

2019-06-16 Thread James Peach (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-9805:
--

  Resolution: Fixed
Assignee: James Peach
Target Version/s: 1.9.0

> Run cgroup subsystems before moving the target PID.
> ---
>
> Key: MESOS-9805
> URL: https://issues.apache.org/jira/browse/MESOS-9805
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> Currently, the Pid targeted by the cgroups isolator is moved into the cgroup 
> before the subsystem runs to apply any type-specific cgroup configuration. We 
> should reverse the order of this so that the PID is only moved once the 
> cgroup is fully configured by the subsystem.
> The specific use case that affected us was where a PID was assigned to a 
> {{net_cls}} cgroup before that cgroup had the class ID set. This caused a 
> separate system to become confused.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9800) libarchive cannot extract tarfile due to UTF-8 encoding issues

2019-06-04 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855625#comment-16855625
 ] 

James Peach commented on MESOS-9800:


We hit the same problem internally a while ago, and carried a patch to refer to 
using {{/usr/bin/tar}}. If you are building your own Mesos, try passing the 
{{\-\-with-llibarchive}} flag to use the system library, which is likely to 
have been built with {{iconv}} support.

> libarchive cannot extract tarfile due to UTF-8 encoding issues
> --
>
> Key: MESOS-9800
> URL: https://issues.apache.org/jira/browse/MESOS-9800
> Project: Mesos
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.7.2
> Environment: Mesos 1.7.2 and Marathon 1.4.3 running on top of Ubuntu 
> 16.04.
>Reporter: Felipe Alfaro Solana
>Priority: Major
> Attachments: certificates2.tar.gz
>
>
> Starting with Mesos 1.7, the following change has been introduced:
>  * [MESOS-8064] - Mesos now requires libarchive to programmatically decode 
> .zip, .tar, .gzip, and other common file compression schemes. Version 3.3.2 
> is bundled in Mesos.
> However, this version of libarchive which is used by the fetcher component in 
> Mesos has problems in dealing with archive files (.tar and .zip) which 
> contain UTF-8 characters. We run Marahton on top of Mesos, and one of our 
> Marathon application relies on a .tar file which contains symlinks whose 
> target contains certain UTF-8 characters (Turkish) or the symlink name itself 
> contains UTF-8 characters. Mesos fetcher is unable to extract the archive and 
> fails with the following error:
> {{May 28 10:47:30 t01m01.node.t01.dns.teralytics.net mesos-slave[4319]: E0528 
> 10:47:30.791250  6136 fetcher.cpp:613] EXIT with status 1: Failed to fetch 
> '/tmp/certificates.tar.gz': Failed to extract archive 
> '/var/mesos/slaves/10c35371-f690-4d40-8b9e-30ffd04405fb-S6/frameworks/ff2993eb-987f-47b0-b3af-fb8b49ab0470-/executors/test-nginx.fe01a0c0-8135-11e9-a160-02427a38aa03/runs/6a6e87e8-5eef-4e8e-8c00-3f081fa187b0/certificates.tar.gz'
>  to 
> '/var/mesos/slaves/10c35371-f690-4d40-8b9e-30ffd04405fb-S6/frameworks/ff2993eb-987f-47b0-b3af-fb8b49ab0470-/executors/test-nginx.fe01a0c0-8135-11e9-a160-02427a38aa03/runs/6a6e87e8-5eef-4e8e-8c00-3f081fa187b0':
>  Failed to read archive header: Linkname can't be converted from UTF-8 to 
> current locale.}}
> {{May 28 10:47:30 t01m01.node.t01.dns.teralytics.net mesos-slave[4319]:}}
> {{May 28 10:47:30 t01m01.node.t01.dns.teralytics.net mesos-slave[4319]: End 
> fetcher log for container 6a6e87e8-5eef-4e8e-8c00-3f081fa187b0}}
> {{May 28 10:47:30 t01m01.node.t01.dns.teralytics.net mesos-slave[4319]: E0528 
> 10:47:30.846695  4343 fetcher.cpp:571] Failed to run mesos-fetcher: Failed to 
> fetch all URIs for container '6a6e87e8-5eef-4e8e-8c00-3f081fa187b0': exited 
> with status 1}}
> The same Marathon application works fine with Mesos 1.6 which does not use 
> libarchive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9805) Run cgroup subsystems before moving the target PID.

2019-05-30 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852665#comment-16852665
 ] 

James Peach commented on MESOS-9805:


/cc [~gilbert], [~jieyu] [~qianzhang]




> Run cgroup subsystems before moving the target PID.
> ---
>
> Key: MESOS-9805
> URL: https://issues.apache.org/jira/browse/MESOS-9805
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James Peach
>Priority: Major
>
> Currently, the Pid targeted by the cgroups isolator is moved into the cgroup 
> before the subsystem runs to apply any type-specific cgroup configuration. We 
> should reverse the order of this so that the PID is only moved once the 
> cgroup is fully configured by the subsystem.
> The specific use case that affected us was where a PID was assigned to a 
> {{net_cls}} cgroup before that cgroup had the class ID set. This caused a 
> separate system to become confused.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9805) Run cgroup subsystems before moving the target PID.

2019-05-30 Thread James Peach (JIRA)
James Peach created MESOS-9805:
--

 Summary: Run cgroup subsystems before moving the target PID.
 Key: MESOS-9805
 URL: https://issues.apache.org/jira/browse/MESOS-9805
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: James Peach


Currently, the Pid targeted by the cgroups isolator is moved into the cgroup 
before the subsystem runs to apply any type-specific cgroup configuration. We 
should reverse the order of this so that the PID is only moved once the cgroup 
is fully configured by the subsystem.

The specific use case that affected us was where a PID was assigned to a 
{{net_cls}} cgroup before that cgroup had the class ID set. This caused a 
separate system to become confused.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9804) Subprocess should close inherited file descriptors earlier

2019-05-30 Thread James Peach (JIRA)
James Peach created MESOS-9804:
--

 Summary: Subprocess should close inherited file descriptors earlier
 Key: MESOS-9804
 URL: https://issues.apache.org/jira/browse/MESOS-9804
 Project: Mesos
  Issue Type: Improvement
  Components: libprocess
Reporter: James Peach


The libprocess {{subprocess}} API doesn't close the file descriptors that are 
inherited across fork until after applying the child hooks. This means that the 
inherited descriptors can remain open for much longer than you expect, since 
parent and child hooks both need to be scheduled and run.

We should move the file descriptor closing as early as possible in the child. 
We might also consider having the child write a byte back to the parent so that 
we have a guaranteed synchronization point.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9799) Adopt container file operations in secrets volumes.

2019-05-28 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16850489#comment-16850489
 ] 

James Peach commented on MESOS-9799:


| [r/70741|https://reviews.apache.org/r/70741] | Adopted container file 
operations for secrets volumes. |

> Adopt container file operations in secrets volumes.
> ---
>
> Key: MESOS-9799
> URL: https://issues.apache.org/jira/browse/MESOS-9799
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> Adopt containerized file operations in the secrets volume isolator so that it 
> doesn't have to use pre-exec commands.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9799) Adopt container file operations in secrets volumes.

2019-05-27 Thread James Peach (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-9799:
--

Assignee: James Peach

> Adopt container file operations in secrets volumes.
> ---
>
> Key: MESOS-9799
> URL: https://issues.apache.org/jira/browse/MESOS-9799
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> Adopt containerized file operations in the secrets volume isolator so that it 
> doesn't have to use pre-exec commands.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9799) Adopt container file operations in secrets volumes.

2019-05-27 Thread James Peach (JIRA)
James Peach created MESOS-9799:
--

 Summary: Adopt container file operations in secrets volumes.
 Key: MESOS-9799
 URL: https://issues.apache.org/jira/browse/MESOS-9799
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: James Peach


Adopt containerized file operations in the secrets volume isolator so that it 
doesn't have to use pre-exec commands.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9769) Add direct containerized support for filesystem operations.

2019-05-22 Thread James Peach (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-9769:
--

Assignee: James Peach

> Add direct containerized support for filesystem operations.
> ---
>
> Key: MESOS-9769
> URL: https://issues.apache.org/jira/browse/MESOS-9769
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> When setting up the container filesystems, we use `pre_exec_commands` to make 
> ABI symlinks and other things. The problem with this is that, depending of 
> the order of operations, we may not have the full security policy in place 
> yet, but since we are running in the context of the container's mount 
> namespaces, the programs we execute are under the control of whoever built 
> the container image.
> [~jieyu] and I previously discussed adding filesystem operations to the 
> `ContainerLaunchInfo`. Just `ln` would be sufficient for the `cgroups` and 
> `linux/filesystem` isolators. Secrets and port mapping isolators need more, 
> so we should discuss and file new tickets if necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9768) Allow operators to mount the container rootfs with the `nosuid` flag

2019-05-20 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844438#comment-16844438
 ] 

James Peach commented on MESOS-9768:


{quote}
What we are primarily interested in is to set it for for the overlay backend 
but there are multiple backend options. Seems like a common flag 
--image_mount_options could be applicable to bind backend as well (maybe aufs 
too? Gilbert Song). It doesn't apply to the copy backend of course.
{quote}

I think that the main mount options that applies to non-overlayfs backends is 
{{MS_RDONLY}}. Since you only get one image provisioner backend, I think that a 
single global option is OK. Each backend can error out it there are any mount 
options provided that it can't support.

Making this a per-container option is more complex. We can table the issue of 
mount flags for non-image volumes here, since I expect that the configuration 
for that will be different.


> Allow operators to mount the container rootfs with the `nosuid` flag
> 
>
> Key: MESOS-9768
> URL: https://issues.apache.org/jira/browse/MESOS-9768
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James Peach
>Priority: Major
>
> If cluster users are allowed to launch containers with arbitrary images, 
> those images may container setuid programs. For security reasons (auditing, 
> privilege escalation), operators may wish to ensure that setuid programs 
> cannot be used within a container.
>  
> We should provide a way for operators to be able to specify that container 
> volumes (including `/`0 should be mounted with the `nosuid` flag.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9771) Mask sensitive procfs paths.

2019-05-19 Thread James Peach (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-9771:
--

Assignee: James Peach

| [r/70678|https://reviews.apache.org/r/70678] | Add containerizer support for 
masking paths. |

> Mask sensitive procfs paths.
> 
>
> Key: MESOS-9771
> URL: https://issues.apache.org/jira/browse/MESOS-9771
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> We already have a set of procfs paths that we mark read-only in the 
> containerizer, but there are additional paths that are considered sensitive 
> by other containerizers and are masked altogether:
> {noformat}
> "/proc/asound"
> "/proc/acpi"
> "/proc/kcore"
> "/proc/keys"
> "/proc/latency_stats"
> "/proc/timer_list"
> "/proc/timer_stats"
> "/proc/sched_debug"
> "/sys/firmware"
> "/proc/scsi"
> {noformat}
> Masking is done by mounting {{/dev/null}} on files, and an empty, readonly 
> {{tmpfs}} on directories.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9771) Mask sensitive procfs paths.

2019-05-06 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834402#comment-16834402
 ] 

James Peach commented on MESOS-9771:


Since {{/proc/keys}} gets masked, we should probably mask {{/proc/key-users}} 
too. Weird that I don't see other containerizers doing that.

My main concern with this change is compatibility with containerized services 
like CSI, that may need privileged access to the host. Masking all these paths 
for this kind of service could break them.

There are a few possible solutions:
1. Skip the masking based on properties of the launch, e.g. whether the Docker 
{{privileged}} flag is set, or whether the container is joining the host's PID 
namespace.
2. Add a flag that specified the set of paths to mask, so that operators can 
whack it with configuration.
3. Unconditionally do the masking.

If we go down the path of (2), then operators who need privileged containers to 
see this information will be stranded, so my preference would be something 
closer to (1).

If we prefer (3), then we already unconditionally make certain container paths 
read-only, which could be regarded as precedent.

/cc [~jieyu] [~gilbert] [~jasonlai]


> Mask sensitive procfs paths.
> 
>
> Key: MESOS-9771
> URL: https://issues.apache.org/jira/browse/MESOS-9771
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James Peach
>Priority: Major
>
> We already have a set of procfs paths that we mark read-only in the 
> containerizer, but there are additional paths that are considered sensitive 
> by other containerizers and are masked altogether:
> {noformat}
>"/proc/asound"
>"/proc/acpi"
> "/proc/kcore"
> "/proc/keys"
> "/proc/latency_stats"
> "/proc/timer_list"
> "/proc/timer_stats"
> "/proc/sched_debug"
> "/sys/firmware"
> "/proc/scsi"
> {noformat}
> Masking is done by mounting {{/dev/null}} on files, and an empty, readonly 
> {{tmpfs}} on directories.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9771) Mask sensitive procfs paths.

2019-05-06 Thread James Peach (JIRA)
James Peach created MESOS-9771:
--

 Summary: Mask sensitive procfs paths.
 Key: MESOS-9771
 URL: https://issues.apache.org/jira/browse/MESOS-9771
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: James Peach


We already have a set of procfs paths that we mark read-only in the 
containerizer, but there are additional paths that are considered sensitive by 
other containerizers and are masked altogether:

{noformat}
  "/proc/asound"
   "/proc/acpi"
"/proc/kcore"
"/proc/keys"
"/proc/latency_stats"
"/proc/timer_list"
"/proc/timer_stats"
"/proc/sched_debug"
"/sys/firmware"
"/proc/scsi"
{noformat}

Masking is done by mounting {{/dev/null}} on files, and an empty, readonly 
{{tmpfs}} on directories.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9770) Add no-new-privileges isolator

2019-05-06 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834398#comment-16834398
 ] 

James Peach commented on MESOS-9770:


/cc [~jieyu] [~gilbert] [~abudnik]

> Add no-new-privileges isolator
> --
>
> Key: MESOS-9770
> URL: https://issues.apache.org/jira/browse/MESOS-9770
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James Peach
>Priority: Major
>
> To give security-minded operators more defense in depth, add a {{linux/nnp}} 
> isolator that sets the no-new-privileges bit before starting the executor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9770) Add no-new-privileges isolator

2019-05-06 Thread James Peach (JIRA)
James Peach created MESOS-9770:
--

 Summary: Add no-new-privileges isolator
 Key: MESOS-9770
 URL: https://issues.apache.org/jira/browse/MESOS-9770
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: James Peach


To give security-minded operators more defense in depth, add a {{linux/nnp}} 
isolator that sets the no-new-privileges bit before starting the executor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9769) Add direct containerized support for filesystem operations

2019-05-06 Thread James Peach (JIRA)
James Peach created MESOS-9769:
--

 Summary: Add direct containerized support for filesystem operations
 Key: MESOS-9769
 URL: https://issues.apache.org/jira/browse/MESOS-9769
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: James Peach


When setting up the container filesystems, we use `pre_exec_commands` to make 
ABI symlinks and other things. The problem with this is that, depending of the 
order of operations, we may not have the full security policy in place yet, but 
since we are running in the context of the container's mount namespaces, the 
programs we execute are under the control of whoever built the container image.

[~jieyu] and I previously discussed adding filesystem operations to the 
`ContainerLaunchInfo`. Just `ln` would be sufficient for the `cgroups` and 
`linux/filesystem` isolators. Secrets and port mapping isolators need more, so 
we should discuss and file new tickets if necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9768) Allow operators to mount the container rootfs with the `nosuid` flag

2019-05-06 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834357#comment-16834357
 ] 

James Peach edited comment on MESOS-9768 at 5/7/19 3:56 AM:


/cc [~jieyu] [~gilbert]


was (Author: jamespeach):
/cc [~jieyu] @gilbert

> Allow operators to mount the container rootfs with the `nosuid` flag
> 
>
> Key: MESOS-9768
> URL: https://issues.apache.org/jira/browse/MESOS-9768
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James Peach
>Priority: Major
>
> If cluster users are allowed to launch containers with arbitrary images, 
> those images may container setuid programs. For security reasons (auditing, 
> privilege escalation), operators may wish to ensure that setuid programs 
> cannot be used within a container.
>  
> We should provide a way for operators to be able to specify that container 
> volumes (including `/`0 should be mounted with the `nosuid` flag.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9768) Allow operators to mount the container rootfs with the `nosuid` flag

2019-05-06 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834357#comment-16834357
 ] 

James Peach commented on MESOS-9768:


/cc [~jieyu] @gilbert

> Allow operators to mount the container rootfs with the `nosuid` flag
> 
>
> Key: MESOS-9768
> URL: https://issues.apache.org/jira/browse/MESOS-9768
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James Peach
>Priority: Major
>
> If cluster users are allowed to launch containers with arbitrary images, 
> those images may container setuid programs. For security reasons (auditing, 
> privilege escalation), operators may wish to ensure that setuid programs 
> cannot be used within a container.
>  
> We should provide a way for operators to be able to specify that container 
> volumes (including `/`0 should be mounted with the `nosuid` flag.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9768) Allow operators to mount the container rootfs with the `nosuid` flag

2019-05-06 Thread James Peach (JIRA)
James Peach created MESOS-9768:
--

 Summary: Allow operators to mount the container rootfs with the 
`nosuid` flag
 Key: MESOS-9768
 URL: https://issues.apache.org/jira/browse/MESOS-9768
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: James Peach


If cluster users are allowed to launch containers with arbitrary images, those 
images may container setuid programs. For security reasons (auditing, privilege 
escalation), operators may wish to ensure that setuid programs cannot be used 
within a container.

 

We should provide a way for operators to be able to specify that container 
volumes (including `/`0 should be mounted with the `nosuid` flag.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9349) Prevent ptracing of container management processes.

2018-12-20 Thread James Peach (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-9349:
--

 Assignee: James Peach
 Priority: Minor  (was: Major)
Fix Version/s: 1.8.0
   Issue Type: Improvement  (was: Bug)

| [r/69615|https://reviews.apache.org/r/69615] | Disable containerizer ptrace 
attach. |

> Prevent ptracing of container management processes.
> ---
>
> Key: MESOS-9349
> URL: https://issues.apache.org/jira/browse/MESOS-9349
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization, security
>Reporter: James Peach
>Assignee: James Peach
>Priority: Minor
> Fix For: 1.8.0
>
>
> The container launcher and the built-in executors are (at least partially) 
> accessible to containerized user tasks. Since these processes may contain 
> secrets or hold privileged resources, we can increase the difficulty of 
> attacking them by preventing user tasks attaching to them with ptrace(2). 
> This amounts to calling `prctl(PR_SET_DUMPABLE, 0)`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9319) Move root filesystem creation to the `filesystem/linux` isolator.

2018-11-26 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699772#comment-16699772
 ] 

James Peach commented on MESOS-9319:


Updated patch series:

| [r/69211|https://reviews.apache.org/r/69211] | Improved the code comments for 
`getContainerDevicesPath`. |
| [r/69210|https://reviews.apache.org/r/69210] | Used the MS_SILENT mount flag 
to elide unwanted logging. |
| [r/69086|https://reviews.apache.org/r/69086] | Moved the container root 
construction to the isolators. |
| [r/69450|https://reviews.apache.org/r/69450] | Applied the 
`ContainerMountInfo` protobuf helper. |

> Move root filesystem creation to the `filesystem/linux` isolator.
> -
>
> Key: MESOS-9319
> URL: https://issues.apache.org/jira/browse/MESOS-9319
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> When using a custom user namespace isolator, the task fails at launch because 
> opening devices fails with a EPERM error. This problem is described in [this 
> systemd issue|https://github.com/systemd/systemd/pull/9483] and [this 
> lxd|https://github.com/lxc/lxd/issues/4950] issue.
> The problem arises in the Mesos containerizer due to the order of operations:
> # Clone the containerizer with {{CLONE_NEWNS}}
> # Mount a tmpfs for the devices
> # mknod for the various device nodes
> Referring back to the lxc issue, because we do (1) before (2), the tmpfs on 
> {{/dev}} is marked {{SB_I_NODEV}}. Due to the new 4.18 behavior, the mkdir in 
> (3) now succeeds (see commit 
> [55956b59df33|https://github.com/torvalds/linux/commit/55956b59df336f6738da916dbb520b6e37df9fbd]).
>  Previously it would fail and we would fall back to bind mounting the device. 
> However, even though we created the device, we can't actually open it due to 
> the {{SB_I_NODEV}} flag on the tmpfs mount. It appears that the purpose of 
> allowing mknod is to that containers can create overlayfs whiteouts.
> One approach to deal with this in the Mesos containerizer is to complete the 
> device node cleanup that was begun in with the linux/devices isolator. This 
> approach involves moving all the responsibility for creating devices back to 
> the isolators. Then, at containerization time, we simply bind-mount the whole 
> of /dev from the per-container staging area. Since the isolators create the 
> devices in the host namespace and on the Mesos work directory, none of the 
> conditions that trigger the failure would be invoked.
> The failure we observed with our tasks was a failure to open {{/dev/null}}, 
> when redirecting it as standard input to a child process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9418) CgroupsIsolatorTest.ROOT_CGROUPS_BlkioUsage fails on 4.19 kernels

2018-11-26 Thread James Peach (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-9418:
--

Assignee: James Peach

> CgroupsIsolatorTest.ROOT_CGROUPS_BlkioUsage fails on 4.19 kernels
> -
>
> Key: MESOS-9418
> URL: https://issues.apache.org/jira/browse/MESOS-9418
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, test
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> The {{CgroupsIsolatorTest.ROOT_CGROUPS_BlkioUsage}} test fails on Linux 4.19 
> kernels.
> {noformat}
> [jpeach@jpeach mesos]$ uname -r
> 4.19.3-300.fc29.x86_64
> [jpeach@jpeach build]$ sudo env GLOG_v=1 ./src/mesos-tests --verbose 
> --gtest_filter=CgroupsIsolatorTest.ROOT_CGROUPS_BlkioUsage
> ...
> W1126 10:45:44.941278 30021 cgroups.cpp:895] Skipping resource statistic for 
> container 8f67e5f9-ebf0-436c-a1d2-f30c69883a27 because: Failed to parse blkio 
> value '8:0 Discard 0' from 'blkio.io_service_bytes': Invalid major:minor 
> device number: 'Discard'
> ../../../src/tests/containerizer/cgroups_isolator_tests.cpp:1890: Failure
> Value of: usage->has_blkio_statistics()
>   Actual: false
> Expected: true
> ../../../src/tests/containerizer/cgroups_isolator_tests.cpp:1891: Failure
> Expected: (2) <= (usage->blkio_statistics().throttling_size()), actual: 2 vs 0
> ../../../src/tests/containerizer/cgroups_isolator_tests.cpp:1902: Failure
> totalThrottling is NONE
> mesos-tests: ../../../3rdparty/stout/include/stout/option.hpp:119: T 
> ::get() & [T = 
> mesos::CgroupInfo_Blkio_Throttling_Statistics]: Assertion `isSome()' failed.
> ...
> {noformat}
> The actual cgroup format is:
> {noformat}
> [jpeach@jpeach blkio]$ pwd
> /sys/fs/cgroup/blkio
> [jpeach@jpeach blkio]$ cat 
> mesos_test_e9c8e0aa-3172-4d8d-b216-c8f5286a7efc/blkio.io_service_bytes
> 8:0 Read 0
> 8:0 Write 0
> 8:0 Sync 0
> 8:0 Async 0
> 8:0 Discard 0
> 8:0 Total 0
> Total 0
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9418) CgroupsIsolatorTest.ROOT_CGROUPS_BlkioUsage fails on 4.19 kernels

2018-11-26 Thread James Peach (JIRA)
James Peach created MESOS-9418:
--

 Summary: CgroupsIsolatorTest.ROOT_CGROUPS_BlkioUsage fails on 4.19 
kernels
 Key: MESOS-9418
 URL: https://issues.apache.org/jira/browse/MESOS-9418
 Project: Mesos
  Issue Type: Bug
  Components: containerization, test
Reporter: James Peach


The {{CgroupsIsolatorTest.ROOT_CGROUPS_BlkioUsage}} test fails on Linux 4.19 
kernels.

{noformat}
[jpeach@jpeach mesos]$ uname -r
4.19.3-300.fc29.x86_64
[jpeach@jpeach build]$ sudo env GLOG_v=1 ./src/mesos-tests --verbose 
--gtest_filter=CgroupsIsolatorTest.ROOT_CGROUPS_BlkioUsage
...
W1126 10:45:44.941278 30021 cgroups.cpp:895] Skipping resource statistic for 
container 8f67e5f9-ebf0-436c-a1d2-f30c69883a27 because: Failed to parse blkio 
value '8:0 Discard 0' from 'blkio.io_service_bytes': Invalid major:minor device 
number: 'Discard'
../../../src/tests/containerizer/cgroups_isolator_tests.cpp:1890: Failure
Value of: usage->has_blkio_statistics()
  Actual: false
Expected: true
../../../src/tests/containerizer/cgroups_isolator_tests.cpp:1891: Failure
Expected: (2) <= (usage->blkio_statistics().throttling_size()), actual: 2 vs 0
../../../src/tests/containerizer/cgroups_isolator_tests.cpp:1902: Failure
totalThrottling is NONE
mesos-tests: ../../../3rdparty/stout/include/stout/option.hpp:119: T 
::get() & [T = 
mesos::CgroupInfo_Blkio_Throttling_Statistics]: Assertion `isSome()' failed.
...
{noformat}

The actual cgroup format is:
{noformat}
[jpeach@jpeach blkio]$ pwd
/sys/fs/cgroup/blkio
[jpeach@jpeach blkio]$ cat 
mesos_test_e9c8e0aa-3172-4d8d-b216-c8f5286a7efc/blkio.io_service_bytes
8:0 Read 0
8:0 Write 0
8:0 Sync 0
8:0 Async 0
8:0 Discard 0
8:0 Total 0
Total 0
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9393) Fetcher crashes extracting archives with non-ASCII filenames.

2018-11-15 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688874#comment-16688874
 ] 

James Peach commented on MESOS-9393:


Probably need to ensure that we are building libarchive with {{\-\-with-iconv}}.

> Fetcher crashes extracting archives with non-ASCII filenames.
> -
>
> Key: MESOS-9393
> URL: https://issues.apache.org/jira/browse/MESOS-9393
> Project: Mesos
>  Issue Type: Bug
>  Components: fetcher
>Reporter: James Peach
>Priority: Critical
>
> {noformat}
> (gdb) bt
> #0  0x7f2ec3827925 in raise () from /lib64/libc.so.6
> #1  0x7f2ec3829105 in abort () from /lib64/libc.so.6
> #2  0x7f2ec3e5da5d in __gnu_cxx::__verbose_terminate_handler() () from 
> /usr/lib64/libstdc++.so.6
> #3  0x7f2ec3e5bbe6 in ?? () from /usr/lib64/libstdc++.so.6
> #4  0x7f2ec3e5bc13 in std::terminate() () from /usr/lib64/libstdc++.so.6
> #5  0x7f2ec3e5bd0e in __cxa_throw () from /usr/lib64/libstdc++.so.6
> #6  0x7f2ec3e00837 in std::__throw_logic_error(char const*) () from 
> /usr/lib64/libstdc++.so.6
> #7  0x7f2ec3e3be59 in ?? () from /usr/lib64/libstdc++.so.6
> #8  0x7f2ec3e3bf33 in std::basic_string, 
> std::allocator >::basic_string(char const*, std::allocator 
> const&) ()
>from /usr/lib64/libstdc++.so.6
> #9  0x555f5e843a6d in archiver::extract (source=...,
> 
> destination="/tmp/mesos/slaves/04f97156-23b7-4411-8fa7-bdec71518221-S1320/frameworks/156b4459-4bb6-460b-89e5-d8c583dee257-0413/executors/cstapper-test-service.simple-pod.test.0.ti9dgngkdceq2_0/runs/4a2a188e-54ef-4"...,
>  flags=) at ../../3rdparty/stout/include/stout/archiver.hpp:130
> #10 0x555f5e859f06 in extract 
> (sourcePath="/tmp/mesos/fetch/siri/c3-ace-inspector.tar.gz",
> 
> destinationDirectory="/tmp/mesos/slaves/04f97156-23b7-4411-8fa7-bdec71518221-S1320/frameworks/156b4459-4bb6-460b-89e5-d8c583dee257-0413/executors/cstapper-test-service.simple-pod.test.0.ti9dgngkdceq2_0/runs/4a2a188e-54ef-4"...)
>  at ../../src/launcher/fetcher.cpp:86
> {noformat}
> {noformat}
> (gdb) p (struct archive_string_conv 
> *)archive_string_conversion_to_charset(entry->archive, "UTF-8", 1)
> $1 = (struct archive_string_conv *) 0x7fe599cd2be0
> (gdb) p >ae_pathname
> $2 = (struct archive_mstring *) 0x7fe599c48010
> (gdb) p (int)archive_strncpy_l(&($2->aes_utf8), $2->aes_mbs.s, 
> $2->aes_mbs.length, $1)
> $3 = -1
> {noformat}
> So archive_strncpy_l() fails with -1. best_effort_strncat_in_locale() has 
> this wonky-looking code:
> {noformat}
> 2235   remaining = length;
> 2236   itp = (const uint8_t *)_p;
> 2237   while (*itp && remaining > 0) {
> 2238 if (*itp > 127) {
> 2239   // Non-ASCII: Substitute with suitable replacement
> 2240   if (sc->flag & SCONV_TO_UTF8) {
> 2241 if (archive_string_append(as, utf8_replacement_char, 
> sizeof(utf8_replacement_char)) == NULL) {
> 2242   __archive_errx(1, "Out of memory");
> 2243 }
> 2244   } else {
> 2245 archive_strappend_char(as, '?');
> 2246   }
> 2247   return_value = -1;
> 2248 } else {
> 2249   archive_strappend_char(as, *itp);
> 2250 }
> 2251 ++itp;
> 2252   }
> (gdb) break best_effort_strncat_in_locale
> Breakpoint 2 at 0x56143c85ff70: file libarchive/archive_string.c, line 2213.
> (gdb) p (int)archive_strncpy_l(&($2->aes_utf8), $2->aes_mbs.s, 
> $2->aes_mbs.length, $1)
> ...
> (gdb)
> 2237  while (*itp && remaining > 0) {
> (gdb)
> 2238  if (*itp > 127) {
> (gdb)
> 2240  if (sc->flag & SCONV_TO_UTF8) {
> (gdb)
> 2241  if (archive_string_append(as, 
> utf8_replacement_char, sizeof(utf8_replacement_char)) == NULL) {
> (gdb)
> 2251  ++itp;
> (gdb)
> 2237  while (*itp && remaining > 0) {
> (gdb)
> 2247  return_value = -1;
> (gdb) p *itp
> $5 = 195 '\303'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9393) Fetcher crashes extracting archives with non-ASCII filenames.

2018-11-15 Thread James Peach (JIRA)
James Peach created MESOS-9393:
--

 Summary: Fetcher crashes extracting archives with non-ASCII 
filenames.
 Key: MESOS-9393
 URL: https://issues.apache.org/jira/browse/MESOS-9393
 Project: Mesos
  Issue Type: Bug
  Components: fetcher
Reporter: James Peach


{noformat}
(gdb) bt
#0  0x7f2ec3827925 in raise () from /lib64/libc.so.6
#1  0x7f2ec3829105 in abort () from /lib64/libc.so.6
#2  0x7f2ec3e5da5d in __gnu_cxx::__verbose_terminate_handler() () from 
/usr/lib64/libstdc++.so.6
#3  0x7f2ec3e5bbe6 in ?? () from /usr/lib64/libstdc++.so.6
#4  0x7f2ec3e5bc13 in std::terminate() () from /usr/lib64/libstdc++.so.6
#5  0x7f2ec3e5bd0e in __cxa_throw () from /usr/lib64/libstdc++.so.6
#6  0x7f2ec3e00837 in std::__throw_logic_error(char const*) () from 
/usr/lib64/libstdc++.so.6
#7  0x7f2ec3e3be59 in ?? () from /usr/lib64/libstdc++.so.6
#8  0x7f2ec3e3bf33 in std::basic_string, 
std::allocator >::basic_string(char const*, std::allocator const&) 
()
   from /usr/lib64/libstdc++.so.6
#9  0x555f5e843a6d in archiver::extract (source=...,

destination="/tmp/mesos/slaves/04f97156-23b7-4411-8fa7-bdec71518221-S1320/frameworks/156b4459-4bb6-460b-89e5-d8c583dee257-0413/executors/cstapper-test-service.simple-pod.test.0.ti9dgngkdceq2_0/runs/4a2a188e-54ef-4"...,
 flags=) at ../../3rdparty/stout/include/stout/archiver.hpp:130
#10 0x555f5e859f06 in extract 
(sourcePath="/tmp/mesos/fetch/siri/c3-ace-inspector.tar.gz",

destinationDirectory="/tmp/mesos/slaves/04f97156-23b7-4411-8fa7-bdec71518221-S1320/frameworks/156b4459-4bb6-460b-89e5-d8c583dee257-0413/executors/cstapper-test-service.simple-pod.test.0.ti9dgngkdceq2_0/runs/4a2a188e-54ef-4"...)
 at ../../src/launcher/fetcher.cpp:86
{noformat}

{noformat}
(gdb) p (struct archive_string_conv 
*)archive_string_conversion_to_charset(entry->archive, "UTF-8", 1)
$1 = (struct archive_string_conv *) 0x7fe599cd2be0
(gdb) p >ae_pathname
$2 = (struct archive_mstring *) 0x7fe599c48010
(gdb) p (int)archive_strncpy_l(&($2->aes_utf8), $2->aes_mbs.s, 
$2->aes_mbs.length, $1)
$3 = -1
{noformat}

So archive_strncpy_l() fails with -1. best_effort_strncat_in_locale() has this 
wonky-looking code:

{noformat}
2235   remaining = length;
2236   itp = (const uint8_t *)_p;
2237   while (*itp && remaining > 0) {
2238 if (*itp > 127) {
2239   // Non-ASCII: Substitute with suitable replacement
2240   if (sc->flag & SCONV_TO_UTF8) {
2241 if (archive_string_append(as, utf8_replacement_char, 
sizeof(utf8_replacement_char)) == NULL) {
2242   __archive_errx(1, "Out of memory");
2243 }
2244   } else {
2245 archive_strappend_char(as, '?');
2246   }
2247   return_value = -1;
2248 } else {
2249   archive_strappend_char(as, *itp);
2250 }
2251 ++itp;
2252   }

(gdb) break best_effort_strncat_in_locale
Breakpoint 2 at 0x56143c85ff70: file libarchive/archive_string.c, line 2213.
(gdb) p (int)archive_strncpy_l(&($2->aes_utf8), $2->aes_mbs.s, 
$2->aes_mbs.length, $1)
...

(gdb)
2237while (*itp && remaining > 0) {
(gdb)
2238if (*itp > 127) {
(gdb)
2240if (sc->flag & SCONV_TO_UTF8) {
(gdb)
2241if (archive_string_append(as, 
utf8_replacement_char, sizeof(utf8_replacement_char)) == NULL) {
(gdb)
2251++itp;
(gdb)
2237while (*itp && remaining > 0) {
(gdb)
2247return_value = -1;
(gdb) p *itp
$5 = 195 '\303'
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9367) GetContainers call crashes when using XFS disk isolation.

2018-11-01 Thread James Peach (JIRA)
James Peach created MESOS-9367:
--

 Summary: GetContainers call crashes when using XFS disk isolation.
 Key: MESOS-9367
 URL: https://issues.apache.org/jira/browse/MESOS-9367
 Project: Mesos
  Issue Type: Bug
  Components: agent
Reporter: James Peach
Assignee: James Peach


Here's the check failure:
{noformat}
F1031 20:30:33.246723 3435208 evolve.cpp:736] Check failed: 
'::protobuf::parse(resource_statistics.get())' Must be 
SOME: Missing required fields: disk_statistics[0].source.type
{noformat}

The JSON that is being rendered into protobufs is:
{noformat}
  "disk_statistics": [
{
  "limit_bytes": 41943040,
  "persistence": {
"id": "7461819b-b0bf-42fc-aa9e-f9958c545523",
"principal": "jarvis-principal"
  },
  "source": {},
  "used_bytes": 25006080
}
  ],
{noformat}

Note the empty "source" element, which triggers the protobuf conversion failure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9319) Move root filesystem creation to the `filesystem/linux` isolator.

2018-10-29 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667915#comment-16667915
 ] 

James Peach commented on MESOS-9319:


Retitling, based on a sightly expanded scope from review feedback. Rather than 
just building /dev in the Linux filesystem isolator, we are going to build the 
whole root filesystem.

| [r/69086|https://reviews.apache.org/r/69086] | Moved container root 
construction to the isolators. |
| [r/69211|https://reviews.apache.org/r/69211] | Improved the code comments for 
`getContainerDevicesPath`. |
| [r/69210|https://reviews.apache.org/r/69210] | Used the MS_SILENT mount flag 
to elide unwanted logging. |

> Move root filesystem creation to the `filesystem/linux` isolator.
> -
>
> Key: MESOS-9319
> URL: https://issues.apache.org/jira/browse/MESOS-9319
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> When using a custom user namespace isolator, the task fails at launch because 
> opening devices fails with a EPERM error. This problem is described in [this 
> systemd issue|https://github.com/systemd/systemd/pull/9483] and [this 
> lxd|https://github.com/lxc/lxd/issues/4950] issue.
> The problem arises in the Mesos containerizer due to the order of operations:
> # Clone the containerizer with {{CLONE_NEWNS}}
> # Mount a tmpfs for the devices
> # mknod for the various device nodes
> Referring back to the lxc issue, because we do (1) before (2), the tmpfs on 
> {{/dev}} is marked {{SB_I_NODEV}}. Due to the new 4.18 behavior, the mkdir in 
> (3) now succeeds (see commit 
> [55956b59df33|https://github.com/torvalds/linux/commit/55956b59df336f6738da916dbb520b6e37df9fbd]).
>  Previously it would fail and we would fall back to bind mounting the device. 
> However, even though we created the device, we can't actually open it due to 
> the {{SB_I_NODEV}} flag on the tmpfs mount. It appears that the purpose of 
> allowing mknod is to that containers can create overlayfs whiteouts.
> One approach to deal with this in the Mesos containerizer is to complete the 
> device node cleanup that was begun in with the linux/devices isolator. This 
> approach involves moving all the responsibility for creating devices back to 
> the isolators. Then, at containerization time, we simply bind-mount the whole 
> of /dev from the per-container staging area. Since the isolators create the 
> devices in the host namespace and on the Mesos work directory, none of the 
> conditions that trigger the failure would be invoked.
> The failure we observed with our tasks was a failure to open {{/dev/null}}, 
> when redirecting it as standard input to a child process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9361) CgroupsIsolatorTest.ROOT_CGROUPS_CreateRecursively always fails.

2018-10-29 Thread James Peach (JIRA)
James Peach created MESOS-9361:
--

 Summary: CgroupsIsolatorTest.ROOT_CGROUPS_CreateRecursively always 
fails.
 Key: MESOS-9361
 URL: https://issues.apache.org/jira/browse/MESOS-9361
 Project: Mesos
  Issue Type: Bug
  Components: flaky, test
Reporter: James Peach


On Fedora 28:

 

 {noformat}
[ RUN  ] CgroupsIsolatorTest.ROOT_CGROUPS_CreateRecursively
I1029 09:38:31.866564 31397 cgroups.cpp:2838] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos_test_62e0c540-832e-4601-8658-7faa25c427ce
I1029 09:38:31.867048 31398 cgroups.cpp:1229] Successfully froze cgroup 
/sys/fs/cgroup/freezer/mesos_test_62e0c540-832e-4601-8658-7faa25c427ce after 
359936ns
I1029 09:38:31.869033 31397 cgroups.cpp:2856] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos_test_62e0c540-832e-4601-8658-7faa25c427ce
I1029 09:38:31.869357 31403 cgroups.cpp:1258] Successfully thawed cgroup 
/sys/fs/cgroup/freezer/mesos_test_62e0c540-832e-4601-8658-7faa25c427ce after 
261888ns
I1029 09:38:31.884752 31382 cluster.cpp:173] Creating default 'local' authorizer
I1029 09:38:31.892966 31397 master.cpp:413] Master 
0b04a175-fe62-41a1-a387-8d679d1d9609 (jpeach.scv.apple.com) started on 
17.228.8.72:42153
I1029 09:38:31.892992 31397 master.cpp:416] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="hierarchical" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/tmp/mFB69h/credentials" --filter_gpu_resources="true" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" 
--publish_per_framework_metrics="true" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/tmp/mFB69h/master" --zk_session_timeout="10secs"
I1029 09:38:31.893931 31397 master.cpp:465] Master only allowing authenticated 
frameworks to register
I1029 09:38:31.893942 31397 master.cpp:471] Master only allowing authenticated 
agents to register
I1029 09:38:31.893951 31397 master.cpp:477] Master only allowing authenticated 
HTTP frameworks to register
I1029 09:38:31.893962 31397 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/mFB69h/credentials'
I1029 09:38:31.894204 31397 master.cpp:521] Using default 'crammd5' 
authenticator
I1029 09:38:31.894359 31397 authenticator.cpp:520] Initializing server SASL
I1029 09:38:31.898878 31397 auxprop.cpp:73] Initialized in-memory auxiliary 
property plugin
I1029 09:38:31.898983 31397 http.cpp:1038] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I1029 09:38:31.899279 31397 http.cpp:1038] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I1029 09:38:31.899395 31397 http.cpp:1038] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I1029 09:38:31.899507 31397 master.cpp:602] Authorization enabled
I1029 09:38:31.900339 31406 whitelist_watcher.cpp:77] No whitelist given
I1029 09:38:31.900434 31400 hierarchical.cpp:175] Initialized hierarchical 
allocator process
I1029 09:38:31.908254 31403 master.cpp:2105] Elected as the leading master!
I1029 09:38:31.908313 31403 master.cpp:1660] Recovering from registrar
I1029 09:38:31.908717 31404 registrar.cpp:339] Recovering registrar
I1029 09:38:31.910310 31400 registrar.cpp:383] Successfully fetched the 
registry (0B) in 1.547776ms
I1029 09:38:31.910684 31400 registrar.cpp:487] Applied 1 operations in 
150793ns; attempting to update the registry
I1029 09:38:31.913811 31400 registrar.cpp:544] Successfully updated the 
registry in 2.979072ms
I1029 09:38:31.914028 31400 registrar.cpp:416] Successfully recovered registrar
I1029 09:38:31.914872 31398 master.cpp:1774] Recovered 0 agents from the 
registry (154B); allowing 10mins for agents to reregister
I1029 09:38:31.914912 31406 hierarchical.cpp:215] Skipping recovery of 
hierarchical allocator: nothing to recover
I1029 09:38:31.920753 31382 

[jira] [Assigned] (MESOS-9354) Automatically remount read-only bind mounts.

2018-10-24 Thread James Peach (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-9354:
--

Assignee: James Peach

> Automatically remount read-only bind mounts.
> 
>
> Key: MESOS-9354
> URL: https://issues.apache.org/jira/browse/MESOS-9354
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization
>Reporter: James Peach
>Assignee: James Peach
>Priority: Minor
>
> To make a bind mount read-only, you have to first make the bind mount, then 
> remount it with the read-only flag. This is a bit arcane, which is why 
> mount(8) does it automatically. We should also do it automatically in 
> {{fs::mount}} so that every caller doesn't have to carry special code to make 
> it work correctly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9354) Automatically remount read-only bind mounts.

2018-10-24 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662995#comment-16662995
 ] 

James Peach commented on MESOS-9354:


/cc [~jieyu]

> Automatically remount read-only bind mounts.
> 
>
> Key: MESOS-9354
> URL: https://issues.apache.org/jira/browse/MESOS-9354
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization
>Reporter: James Peach
>Priority: Minor
>
> To make a bind mount read-only, you have to first make the bind mount, then 
> remount it with the read-only flag. This is a bit arcane, which is why 
> mount(8) does it automatically. We should also do it automatically in 
> {{fs::mount}} so that every caller doesn't have to carry special code to make 
> it work correctly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9354) Automatically remount read-only bind mounts.

2018-10-24 Thread James Peach (JIRA)
James Peach created MESOS-9354:
--

 Summary: Automatically remount read-only bind mounts.
 Key: MESOS-9354
 URL: https://issues.apache.org/jira/browse/MESOS-9354
 Project: Mesos
  Issue Type: Bug
  Components: agent, containerization
Reporter: James Peach


To make a bind mount read-only, you have to first make the bind mount, then 
remount it with the read-only flag. This is a bit arcane, which is why mount(8) 
does it automatically. We should also do it automatically in {{fs::mount}} so 
that every caller doesn't have to carry special code to make it work correctly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9349) Prevent ptracing of container management processes.

2018-10-23 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661330#comment-16661330
 ] 

James Peach commented on MESOS-9349:


The plan here is to add an agent flag for operator visibility (probably the 
default should be enabled, so we improve security by default). We can examine 
the flag in the linux launcher, but from then on we can just sample and 
propagate the current setting.

> Prevent ptracing of container management processes.
> ---
>
> Key: MESOS-9349
> URL: https://issues.apache.org/jira/browse/MESOS-9349
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, security
>Reporter: James Peach
>Priority: Major
>
> The container launcher and the built-in executors are (at least partially) 
> accessible to containerized user tasks. Since these processes may contain 
> secrets or hold privileged resources, we can increase the difficulty of 
> attacking them by preventing user tasks attaching to them with ptrace(2). 
> This amounts to calling `prctl(PR_SET_DUMPABLE, 0)`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9349) Prevent ptracing of container management processes.

2018-10-23 Thread James Peach (JIRA)
James Peach created MESOS-9349:
--

 Summary: Prevent ptracing of container management processes.
 Key: MESOS-9349
 URL: https://issues.apache.org/jira/browse/MESOS-9349
 Project: Mesos
  Issue Type: Bug
  Components: containerization, security
Reporter: James Peach


The container launcher and the built-in executors are (at least partially) 
accessible to containerized user tasks. Since these processes may contain 
secrets or hold privileged resources, we can increase the difficulty of 
attacking them by preventing user tasks attaching to them with ptrace(2). This 
amounts to calling `prctl(PR_SET_DUMPABLE, 0)`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9348) URL-encoded HDFS artifacts can't be fetched through the cache.

2018-10-22 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16659865#comment-16659865
 ] 

James Peach commented on MESOS-9348:


One approach here is to URL-encode the output filename for the HDFS command. 
Experimentally, it looks like this is required, since the command errors out on 
unsafe characters:

{noformat}
# hdfs dfs -copyToLocal 
hdfs:///artifacts/8c/99/4b/8c994b489674589a58805e2e695e98674b9dd793411579f0fbaea3459f94f86e/connector/%5BRELEASE%5D/connector-%5BRELEASE%5D.jar
 $(pwd)/%255B-jpeach-].jar
copyToLocal: unexpected URISyntaxException
{noformat}

> URL-encoded HDFS artifacts can't be fetched through the cache.
> --
>
> Key: MESOS-9348
> URL: https://issues.apache.org/jira/browse/MESOS-9348
> Project: Mesos
>  Issue Type: Bug
>  Components: fetcher
>Reporter: James Peach
>Priority: Major
>
> The {{hdfs dfs}} command always does a URI decode on the target output file. 
> This means that the output file gets stored in the fetcher cache under the 
> wrong filename and we can never retrieve it.
> Here's an example of how the command behaves:
> {noformat}
> [/tmp]# hdfs dfs -copyToLocal 
> hdfs:///artifacts/8c/99/4b/8c994b489674589a58805e2e695e98674b9dd793411579f0fbaea3459f94f86e/connector/%5BRELEASE%5D/connector-%5BRELEASE%5D.jar
>  $(pwd)/%5B-jpeach-%5D.jar
> [/tmp]# ls -l *jpeach*
> -rw-r--r-- 1 root root 7285799 Oct 22 23:29 [-jpeach-].jar
> {noformat}
> Here's how this plays out in the fetcher:
> {noformat}
> W1022 23:22:13.649587 3186459 fetcher.cpp:395] Copying instead of extracting 
> resource from URI with 'extract' flag, because it does not seem to be an 
> archive: 
> hdfs:///artifacts/8c/99/4b/8c994b489674589a58805e2e695e98674b9dd793411579f0fbaea3459f94f86e/connector/%5BRELEASE%5D/connector-%5BRELEASE%5D.jar
> cp: cannot stat `/srv/mesos/fetch/jarvis/c67-connector-_ASE%5D.jar': No such 
> file or directory
> E1022 23:22:13.652987 3186459 fetcher.cpp:613] EXIT with status 1: Failed to 
> fetch 
> 'hdfs:///artifacts/8c/99/4b/8c994b489674589a58805e2e695e98674b9dd793411579f0fbaea3459f94f86e/connector/%5BRELEASE%5D/connector-%5BRELEASE%5D.jar':
>  cp failed with status: 256
> ...
> # ls -latr /srv/mesos/fetch
> ...
> -rw-r--r-- 1 jarvis jarvis   7285799 Oct 22 23:22 c67-connector-_ASE].jar
> {noformat}
> The fetcher has downloaded the artifact into the cache, but can't copy it 
> into the sandbox because it was downloaded to the wrong filename.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9348) URL-encoded HDFS artifacts can't be fetched through the cache.

2018-10-22 Thread James Peach (JIRA)
James Peach created MESOS-9348:
--

 Summary: URL-encoded HDFS artifacts can't be fetched through the 
cache.
 Key: MESOS-9348
 URL: https://issues.apache.org/jira/browse/MESOS-9348
 Project: Mesos
  Issue Type: Bug
  Components: fetcher
Reporter: James Peach


The {{hdfs dfs}} command always does a URI decode on the target output file. 
This means that the output file gets stored in the fetcher cache under the 
wrong filename and we can never retrieve it.

Here's an example of how the command behaves:
{noformat}
[/tmp]# hdfs dfs -copyToLocal 
hdfs:///artifacts/8c/99/4b/8c994b489674589a58805e2e695e98674b9dd793411579f0fbaea3459f94f86e/connector/%5BRELEASE%5D/connector-%5BRELEASE%5D.jar
 $(pwd)/%5B-jpeach-%5D.jar

[/tmp]# ls -l *jpeach*
-rw-r--r-- 1 root root 7285799 Oct 22 23:29 [-jpeach-].jar
{noformat}

Here's how this plays out in the fetcher:
{noformat}
W1022 23:22:13.649587 3186459 fetcher.cpp:395] Copying instead of extracting 
resource from URI with 'extract' flag, because it does not seem to be an 
archive: 
hdfs:///artifacts/8c/99/4b/8c994b489674589a58805e2e695e98674b9dd793411579f0fbaea3459f94f86e/connector/%5BRELEASE%5D/connector-%5BRELEASE%5D.jar
cp: cannot stat `/srv/mesos/fetch/jarvis/c67-connector-_ASE%5D.jar': No such 
file or directory
E1022 23:22:13.652987 3186459 fetcher.cpp:613] EXIT with status 1: Failed to 
fetch 
'hdfs:///artifacts/8c/99/4b/8c994b489674589a58805e2e695e98674b9dd793411579f0fbaea3459f94f86e/connector/%5BRELEASE%5D/connector-%5BRELEASE%5D.jar':
 cp failed with status: 256
...
# ls -latr /srv/mesos/fetch
...
-rw-r--r-- 1 jarvis jarvis   7285799 Oct 22 23:22 c67-connector-_ASE].jar
{noformat}

The fetcher has downloaded the artifact into the cache, but can't copy it into 
the sandbox because it was downloaded to the wrong filename.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9319) Create all container devices at isolation time.

2018-10-16 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652677#comment-16652677
 ] 

James Peach commented on MESOS-9319:


Prototype code looks promising. Currently, /dev is a tmpfs, but in this 
proposal it would be a bind mount to a real filesystem. I'm binding it in 
read-only to prevent disk quota escapes, which seems to work OK.

> Create all container devices at isolation time.
> ---
>
> Key: MESOS-9319
> URL: https://issues.apache.org/jira/browse/MESOS-9319
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> When using a custom user namespace isolator, the task fails at launch because 
> opening devices fails with a EPERM error. This problem is described in [this 
> system issue|https://github.com/systemd/systemd/pull/9483] and [this 
> lxd|https://github.com/lxc/lxd/issues/4950] issue.
> The problem arises in the Mesos containerizer due to the order of operations:
> # Clone the containerizer with {{CLONE_NEWNS}}
> # Mount a tmpfs for the devices
> # mknod for the various device nodes
> Referring back to the lxc issue, because we do (1) before (2), the tmpfs on 
> {{/dev}} is marked {{SB_I_NODEV}}. Due to the new 4.18 behavior, the mkdir in 
> (3) now succeeds (see commit 
> [55956b59df33|https://github.com/torvalds/linux/commit/55956b59df336f6738da916dbb520b6e37df9fbd]).
>  Previously it would fail and we would fall back to bind mounting the device. 
> However, even though we created the device, we can't actually open it due to 
> the {{SB_I_NODEV}} flag on the tmpfs mount. It appears that the purpose of 
> allowing mknod is to that containers can create overlayfs whiteouts.
> One approach to deal with this in the Mesos containerizer is to complete the 
> device node cleanup that was begun in with the linux/devices isolator. This 
> approach involves moving all the responsibility for creating devices back to 
> the isolators. Then, at containerization time, we simply bind-mount the whole 
> of /dev from the per-container staging area. Since the isolators create the 
> devices in the host namespace and on the Mesos work directory, none of the 
> conditions that trigger the failure would be invoked.
> The failure we observed with our tasks was a failure to open {{/dev/null}}, 
> when redirecting it as standard input to a child process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9302) Mesos fails to build on Fedora 28

2018-10-16 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651914#comment-16651914
 ] 

James Peach commented on MESOS-9302:


Upstream cares fix is [#209|https://github.com/c-ares/c-ares/pull/209]

> Mesos fails to build on Fedora 28
> -
>
> Key: MESOS-9302
> URL: https://issues.apache.org/jira/browse/MESOS-9302
> Project: Mesos
>  Issue Type: Bug
> Environment: gcc (GCC) 8.1.1 20180712 (Red Hat 8.1.1-5)
> Fedora 28
>Reporter: Benno Evers
>Priority: Major
>  Labels: build-failure
>
> Trying to compile a fresh Mesos checkout on a Fedora 28 system with the 
> following configuration flags:
> {noformat}
> ../configure --enable-debug --enable-optimize --disable-java --disable-python 
> --disable-libtool-wrappers --enable-ssl --enable-libevent --disable-werror
> {noformat}
> and the following compiler
> {noformat}
> [bev...@core1.hw.ca1 build]$ gcc --version
> gcc (GCC) 8.1.1 20180712 (Red Hat 8.1.1-5)
> Copyright (C) 2018 Free Software Foundation, Inc.
> This is free software; see the source for copying conditions.  There is NO
> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
> {noformat}
> fails the build due to two warnings (even though --disable-werror was passed):
> {noformat}
> make[4]: Entering directory '/home/bevers/mesos/build/3rdparty/grpc-1.10.0'
> [C]   Compiling third_party/cares/cares/ares_init.c
> third_party/cares/cares/ares_init.c: In function ‘ares_dup’:
> third_party/cares/cares/ares_init.c:301:17: error: argument to ‘sizeof’ in 
> ‘strncpy’ call is the same expression as the source; did you mean to use the 
> size of the destination? [-Werror=sizeof-pointer-memaccess]
>sizeof(src->local_dev_name));
>  ^
> third_party/cares/cares/ares_init.c: At top level:
> cc1: error: unrecognized command line option ‘-Wno-invalid-source-encoding’ 
> [-Werror]
> cc1: all warnings being treated as errors
> make[4]: *** [Makefile:2635: 
> /home/bevers/mesos/build/3rdparty/grpc-1.10.0/objs/opt/third_party/cares/cares/ares_init.o]
>  Error 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9302) Mesos fails to build on Fedora 28

2018-10-16 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651914#comment-16651914
 ] 

James Peach edited comment on MESOS-9302 at 10/16/18 3:21 PM:
--

Upstream c-ares fix is [#209|https://github.com/c-ares/c-ares/pull/209]


was (Author: jamespeach):
Upstream cares fix is [#209|https://github.com/c-ares/c-ares/pull/209]

> Mesos fails to build on Fedora 28
> -
>
> Key: MESOS-9302
> URL: https://issues.apache.org/jira/browse/MESOS-9302
> Project: Mesos
>  Issue Type: Bug
> Environment: gcc (GCC) 8.1.1 20180712 (Red Hat 8.1.1-5)
> Fedora 28
>Reporter: Benno Evers
>Priority: Major
>  Labels: build-failure
>
> Trying to compile a fresh Mesos checkout on a Fedora 28 system with the 
> following configuration flags:
> {noformat}
> ../configure --enable-debug --enable-optimize --disable-java --disable-python 
> --disable-libtool-wrappers --enable-ssl --enable-libevent --disable-werror
> {noformat}
> and the following compiler
> {noformat}
> [bev...@core1.hw.ca1 build]$ gcc --version
> gcc (GCC) 8.1.1 20180712 (Red Hat 8.1.1-5)
> Copyright (C) 2018 Free Software Foundation, Inc.
> This is free software; see the source for copying conditions.  There is NO
> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
> {noformat}
> fails the build due to two warnings (even though --disable-werror was passed):
> {noformat}
> make[4]: Entering directory '/home/bevers/mesos/build/3rdparty/grpc-1.10.0'
> [C]   Compiling third_party/cares/cares/ares_init.c
> third_party/cares/cares/ares_init.c: In function ‘ares_dup’:
> third_party/cares/cares/ares_init.c:301:17: error: argument to ‘sizeof’ in 
> ‘strncpy’ call is the same expression as the source; did you mean to use the 
> size of the destination? [-Werror=sizeof-pointer-memaccess]
>sizeof(src->local_dev_name));
>  ^
> third_party/cares/cares/ares_init.c: At top level:
> cc1: error: unrecognized command line option ‘-Wno-invalid-source-encoding’ 
> [-Werror]
> cc1: all warnings being treated as errors
> make[4]: *** [Makefile:2635: 
> /home/bevers/mesos/build/3rdparty/grpc-1.10.0/objs/opt/third_party/cares/cares/ares_init.o]
>  Error 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9319) Create all container devices at isolation time.

2018-10-15 Thread James Peach (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-9319:
--

Assignee: James Peach

> Create all container devices at isolation time.
> ---
>
> Key: MESOS-9319
> URL: https://issues.apache.org/jira/browse/MESOS-9319
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> When using a custom user namespace isolator, the task fails at launch because 
> opening devices fails with a EPERM error. This problem is described in [this 
> system issue|https://github.com/systemd/systemd/pull/9483] and [this 
> lxd|https://github.com/lxc/lxd/issues/4950] issue.
> The problem arises in the Mesos containerizer due to the order of operations:
> # Clone the containerizer with {{CLONE_NEWNS}}
> # Mount a tmpfs for the devices
> # mknod for the various device nodes
> Referring back to the lxc issue, because we do (1) before (2), the tmpfs on 
> {{/dev}} is marked {{SB_I_NODEV}}. Due to the new 4.18 behavior, the mkdir in 
> (3) now succeeds (see commit 
> [55956b59df33|https://github.com/torvalds/linux/commit/55956b59df336f6738da916dbb520b6e37df9fbd]).
>  Previously it would fail and we would fall back to bind mounting the device. 
> However, even though we created the device, we can't actually open it due to 
> the {{SB_I_NODEV}} flag on the tmpfs mount. It appears that the purpose of 
> allowing mknod is to that containers can create overlayfs whiteouts.
> One approach to deal with this in the Mesos containerizer is to complete the 
> device node cleanup that was begun in with the linux/devices isolator. This 
> approach involves moving all the responsibility for creating devices back to 
> the isolators. Then, at containerization time, we simply bind-mount the whole 
> of /dev from the per-container staging area. Since the isolators create the 
> devices in the host namespace and on the Mesos work directory, none of the 
> conditions that trigger the failure would be invoked.
> The failure we observed with our tasks was a failure to open {{/dev/null}}, 
> when redirecting it as standard input to a child process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (MESOS-9319) Create all container devices at isolation time

2018-10-15 Thread James Peach (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach updated MESOS-9319:
---
Comment: was deleted

(was: When using a custom user namespace isolator, the task fails at launch 
because opening devices fails with a {{EPERM}} error. This problem is described 
in [this system issue|https://github.com/systemd/systemd/pull/9483] and this 
[lxd issue|https://github.com/lxc/lxd/issues/4950].

The problem arises in the Mesos containerizer due to the order of operations:

# Clone the containerizer with CLONE_NEWNS
# Mount a tmpfs for the devices
# mknod for the various device nodes

Referring back to the lxc issue, because we do (1) before (2), the tmpfs on 
/dev is marked SB_I_NODEV. Due to the new 4.18 behavior, the mkdir in (3) now 
succeeds (see commit 
[55956b59df33|https://github.com/torvalds/linux/commit/55956b59df336f6738da916dbb520b6e37df9fbd]).
 Previously it would fail and we would fall back to bind mounting the device. 
However, even though we created the device, we can't actually open it due to 
the SB_I_NODEV flag on the tmpfs mount. It appears that the purpose of allowing 
mknod is to that containers can create overlayfs whiteouts.

One approach to deal with this in the Mesos containerizer is to complete the 
device node cleanup that was begun in with the linux/devices isolator. This 
approach involves moving all the responsibility for creating devices back to 
the isolators. Then, at containerization time, we simply bind-mount the whole 
of /dev from the per-container staging area. Since the isolators create the 
devices in the host namespace and on the Mesos work directory, none of the 
conditions that trigger the failure would be invoked.

The failure we observed with our tasks was a failure to open {{/dev/null}}, 
when redirecting it as standard input to a child process.)

> Create all container devices at isolation time
> --
>
> Key: MESOS-9319
> URL: https://issues.apache.org/jira/browse/MESOS-9319
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: James Peach
>Priority: Major
>
> When using a custom user namespace isolator, the task fails at launch because 
> opening devices fails with a EPERM error. This problem is described in this 
> system issue and this lxd issue.
> The problem arises in the Mesos containerizer due to the order of operations:
> Clone the containerizer with CLONE_NEWNS
> Mount a tmpfs for the devices
> mknod for the various device nodes
> Referring back to the lxc issue, because we do (1) before (2), the tmpfs on 
> /dev is marked SB_I_NODEV. Due to the new 4.18 behavior, the mkdir in (3) now 
> succeeds (see commit 55956b59df33). Previously it would fail and we would 
> fall back to bind mounting the device. However, even though we created the 
> device, we can't actually open it due to the SB_I_NODEV flag on the tmpfs 
> mount. It appears that the purpose of allowing mknod is to that containers 
> can create overlayfs whiteouts.
> One approach to deal with this in the Mesos containerizer is to complete the 
> device node cleanup that was begun in with the linux/devices isolator. This 
> approach involves moving all the responsibility for creating devices back to 
> the isolators. Then, at containerization time, we simply bind-mount the whole 
> of /dev from the per-container staging area. Since the isolators create the 
> devices in the host namespace and on the Mesos work directory, none of the 
> conditions that trigger the failure would be invoked.
> The failure we observed with our tasks was a failure to open /dev/null, when 
> redirecting it as standard input to a child process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9319) Create all container devices at isolation time

2018-10-15 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650806#comment-16650806
 ] 

James Peach edited comment on MESOS-9319 at 10/15/18 9:18 PM:
--

When using a custom user namespace isolator, the task fails at launch because 
opening devices fails with a {{EPERM}} error. This problem is described in 
[this system issue|https://github.com/systemd/systemd/pull/9483] and this [lxd 
issue|https://github.com/lxc/lxd/issues/4950].

The problem arises in the Mesos containerizer due to the order of operations:

# Clone the containerizer with CLONE_NEWNS
# Mount a tmpfs for the devices
# mknod for the various device nodes

Referring back to the lxc issue, because we do (1) before (2), the tmpfs on 
/dev is marked SB_I_NODEV. Due to the new 4.18 behavior, the mkdir in (3) now 
succeeds (see commit 
[55956b59df33|https://github.com/torvalds/linux/commit/55956b59df336f6738da916dbb520b6e37df9fbd]).
 Previously it would fail and we would fall back to bind mounting the device. 
However, even though we created the device, we can't actually open it due to 
the SB_I_NODEV flag on the tmpfs mount. It appears that the purpose of allowing 
mknod is to that containers can create overlayfs whiteouts.

One approach to deal with this in the Mesos containerizer is to complete the 
device node cleanup that was begun in with the linux/devices isolator. This 
approach involves moving all the responsibility for creating devices back to 
the isolators. Then, at containerization time, we simply bind-mount the whole 
of /dev from the per-container staging area. Since the isolators create the 
devices in the host namespace and on the Mesos work directory, none of the 
conditions that trigger the failure would be invoked.

The failure we observed with our tasks was a failure to open {{/dev/null}}, 
when redirecting it as standard input to a child process.


was (Author: jamespeach):
When using a custom user namespace isolator, the task fails at launch because 
opening devices fails with a {{EPERM}} error. This problem is described in 
[this system issue|https://github.com/systemd/systemd/pull/9483] and this [lxd 
issue|https://github.com/lxc/lxd/issues/4950].

The problem arises in the Mesos containerizer due to the order of operations:

# Clone the containerizer with CLONE_NEWNS
# Mount a tmpfs for the devices
# mknod for the various device nodes

Referring back to the lxc issue, because we do (1) before (2), the tmpfs on 
/dev is marked SB_I_NODEV. Due to the new 4.18 behavior, the mkdir in (3) now 
succeeds (see commit 
[55956b59df33|https://github.com/torvalds/linux/commit/55956b59df336f6738da916dbb520b6e37df9fbd]).
 Previously it would fail and we would fall back to bind mounting the device. 
However, even though we created the device, we can't actually open it due to 
the SB_I_NODEV flag on the tmpfs mount. It appears that the purpose of allowing 
mknod is to that containers can create overlayfs whiteouts.

One approach to deal with this in the Mesos containerizer is to complete the 
device node cleanup that was begun in with the linux/devices isolator. This 
approach involves moving all the responsibility for creating devices back to 
the isolators. Then, at containerization time, we simply bind-mount the whole 
of /dev from the per-container staging area. Since the isolators create the 
devices in the host namespace and on the Mesos work directory, none of the 
conditions that trigger the failure would be invoked.


> Create all container devices at isolation time
> --
>
> Key: MESOS-9319
> URL: https://issues.apache.org/jira/browse/MESOS-9319
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: James Peach
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9319) Create all container devices at isolation time

2018-10-15 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650806#comment-16650806
 ] 

James Peach commented on MESOS-9319:


When using a custom user namespace isolator, the task fails at launch because 
opening devices fails with a {{EPERM}} error. This problem is described in 
[this system issue|https://github.com/systemd/systemd/pull/9483] and this [lxd 
issue|https://github.com/lxc/lxd/issues/4950].

The problem arises in the Mesos containerizer due to the order of operations:

# Clone the containerizer with CLONE_NEWNS
# Mount a tmpfs for the devices
# mknod for the various device nodes

Referring back to the lxc issue, because we do (1) before (2), the tmpfs on 
/dev is marked SB_I_NODEV. Due to the new 4.18 behavior, the mkdir in (3) now 
succeeds (see commit 
[55956b59df33|https://github.com/torvalds/linux/commit/55956b59df336f6738da916dbb520b6e37df9fbd]).
 Previously it would fail and we would fall back to bind mounting the device. 
However, even though we created the device, we can't actually open it due to 
the SB_I_NODEV flag on the tmpfs mount. It appears that the purpose of allowing 
mknod is to that containers can create overlayfs whiteouts.

One approach to deal with this in the Mesos containerizer is to complete the 
device node cleanup that was begun in with the linux/devices isolator. This 
approach involves moving all the responsibility for creating devices back to 
the isolators. Then, at containerization time, we simply bind-mount the whole 
of /dev from the per-container staging area. Since the isolators create the 
devices in the host namespace and on the Mesos work directory, none of the 
conditions that trigger the failure would be invoked.


> Create all container devices at isolation time
> --
>
> Key: MESOS-9319
> URL: https://issues.apache.org/jira/browse/MESOS-9319
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: When using a custom user namespace isolator, the task 
> fails at launch because opening devices fails with a {{EPERM}} error. This 
> problem is described in [this system 
> issue|https://github.com/systemd/systemd/pull/9483] and this [lxd 
> issue|https://github.com/lxc/lxd/issues/4950].
> The problem arises in the Mesos containerizer due to the order of operations:
> # Clone the containerizer with CLONE_NEWNS
> # Mount a tmpfs for the devices
> # mknod for the various device nodes
> Referring back to the lxc issue, because we do (1) before (2), the tmpfs on 
> /dev is marked SB_I_NODEV. Due to the new 4.18 behavior, the mkdir in (3) now 
> succeeds (see commit 
> [55956b59df33|https://github.com/torvalds/linux/commit/55956b59df336f6738da916dbb520b6e37df9fbd]).
>  Previously it would fail and we would fall back to bind mounting the device. 
> However, even though we created the device, we can't actually open it due to 
> the SB_I_NODEV flag on the tmpfs mount. It appears that the purpose of 
> allowing mknod is to that containers can create overlayfs whiteouts.
> One approach to deal with this in the Mesos containerizer is to complete the 
> device node cleanup that was begun in with the linux/devices isolator. This 
> approach involves moving all the responsibility for creating devices back to 
> the isolators. Then, at containerization time, we simply bind-mount the whole 
> of /dev from the per-container staging area. Since the isolators create the 
> devices in the host namespace and on the Mesos work directory, none of the 
> conditions that trigger the failure would be invoked.
>Reporter: James Peach
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9319) Create all container devices at isolation time

2018-10-15 Thread James Peach (JIRA)
James Peach created MESOS-9319:
--

 Summary: Create all container devices at isolation time
 Key: MESOS-9319
 URL: https://issues.apache.org/jira/browse/MESOS-9319
 Project: Mesos
  Issue Type: Bug
  Components: containerization
 Environment: When using a custom user namespace isolator, the task 
fails at launch because opening devices fails with a {{EPERM}} error. This 
problem is described in [this system 
issue|https://github.com/systemd/systemd/pull/9483] and this [lxd 
issue|https://github.com/lxc/lxd/issues/4950].

The problem arises in the Mesos containerizer due to the order of operations:

# Clone the containerizer with CLONE_NEWNS
# Mount a tmpfs for the devices
# mknod for the various device nodes

Referring back to the lxc issue, because we do (1) before (2), the tmpfs on 
/dev is marked SB_I_NODEV. Due to the new 4.18 behavior, the mkdir in (3) now 
succeeds (see commit 
[55956b59df33|https://github.com/torvalds/linux/commit/55956b59df336f6738da916dbb520b6e37df9fbd]).
 Previously it would fail and we would fall back to bind mounting the device. 
However, even though we created the device, we can't actually open it due to 
the SB_I_NODEV flag on the tmpfs mount. It appears that the purpose of allowing 
mknod is to that containers can create overlayfs whiteouts.

One approach to deal with this in the Mesos containerizer is to complete the 
device node cleanup that was begun in with the linux/devices isolator. This 
approach involves moving all the responsibility for creating devices back to 
the isolators. Then, at containerization time, we simply bind-mount the whole 
of /dev from the per-container staging area. Since the isolators create the 
devices in the host namespace and on the Mesos work directory, none of the 
conditions that trigger the failure would be invoked.

Reporter: James Peach






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8313) Provide a host namespace container supervisor.

2018-10-15 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358625#comment-16358625
 ] 

James Peach edited comment on MESOS-8313 at 10/15/18 6:38 PM:
--

Note, this supervisor need to reap all its children, as per MESOS-5893.


was (Author: jamespeach):
Note, this supervisor need to read all its children, as per MESOS-5893.

> Provide a host namespace container supervisor.
> --
>
> Key: MESOS-8313
> URL: https://issues.apache.org/jira/browse/MESOS-8313
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
> Attachments: IMG_2629.JPG
>
>
> After more investigation on user namespaces, the current implementation of 
> creating the container namespaces needs some adjustment before we can 
> implement user namespaces in a useable fashion.
> The problems we need to address are:
> 1. The containerizer needs to hold {{CAP_SYS_ADMIN}} over the PID namespace 
> to mount {{procfs}}. Currently, this prevents containers joining the host PID 
> namespace. The workaround is to always create a new container PID namespace 
> (as a child of the user namespace) with the {{namespaces/pid}} isolator.
> 2. The containerizer needs to hold {{CAP_SYS_ADMIN}} over the network 
> namespace to mount {{sysfs}}. There's no general workaround for this since we 
> can't generally require containers to not join the host network namespace.
> 3. The containerizer can't enter a user namespace after entering the 
> {{chroot}}. This restriction makes the existing order of containerizer 
> operations impossible to remain in the case where we want the executor to be 
> in a new user namespace that has no children (i.e. to protect the container 
> from a privileged task).
> After some discussion with [~jieyu], we believe that we can some most or all 
> of these issues by creating a new containerized supervisor that runs fully 
> outside the container and is responsible for constructing the roots mount 
> namespace, launching the containerized to enter the rest of the container, 
> and waiting on the entered process.
> Since this new supervisor process is not running in the user namespace, it 
> will be able to construct the container rootfs in a new mount namespace 
> without user namespace restrictions. We can then clone a child to fully 
> create and enter container namespaces along with the prefabricated rootfs 
> mount namespace.
> The only drawback to this approach is that the container's mount namespace 
> will be owned by the root user namespace rather than the container user 
> namespace. We are OK with this for now.
> The plan here is to retain the existing {{mesos-containerizer launch}} 
> subcommand and add a new {{mesos-containerizer supervise}} subcommand, which 
> will be its parent process. This new subcommand will be used for the default 
> executor and custom executor code paths.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9300) XFS isolator can mislabel project IDs on persistence volumes.

2018-10-08 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16642305#comment-16642305
 ] 

James Peach commented on MESOS-9300:


MacOS has 
[ATTR_DIR_MOUNTSTATUS|https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man2/getattrlist.2.html#//apple_ref/doc/man/2/getattrlist],
 but AFAIK there's not a straight-forward equivalent on Linux.

However like we can detect this on Linux with [EXDEV rename 
trick|http://blog.schmorp.de/2016-03-03-detecting-a-mount-point.html]

> XFS isolator can mislabel project IDs on persistence volumes.
> -
>
> Key: MESOS-9300
> URL: https://issues.apache.org/jira/browse/MESOS-9300
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> What happens here is that we are erroneously applying the sandbox's project 
> ID to the persistent volume.
> First, the filesystem/linux isolator bind mounts the persistent volume into 
> the sandbox:
> {noformat}
> I1003 06:49:21.907644 2812466 linux.cpp:593] Mounting 
> '/srv/mesos/work/volumes/roles/pie.mobius/21cb2eb6-b3e5-46f2-944e-8f6e5db9f07f'
>  to 
> '/srv/mesos/work/slaves/909cff92-8e17-41bf-a251-9b5eb6186c35-S0/frameworks/363e6d80-8c38-46cf-815f-2fbf60a62628-0309/executors/mobius-mloop-1538549013_438156792-v2-shared-volume.pod1.writer-job.0.e93hs3uips2i9_1/runs/9e5770a7-9f78-46dc-9264-3e80be0e40cc/shared'
>  for persistent volume disk(allocated: pie.mobius)(reservations: 
> [(DYNAMIC,pie.mobius,jarvis-principal,\{podInstance: e93hs3uips2i9, pod: 
> pod1, service: 
> mobius-mloop-1538549013_438156792-v2-shared-volume})])[21cb2eb6-b3e5-46f2-944e-8f6e5db9f07f:shared]:1
>  of container 9e5770a7-9f78-46dc-9264-3e80be0e40cc
> {noformat}
> Next, the `disk/xfs` isolator assigns a project ID to the sandbox:
> {noformat}
> I1003 06:49:21.920197 2812452 disk.cpp:402] Assigned project 6806 to 
> '/srv/mesos/work/slaves/909cff92-8e17-41bf-a251-9b5eb6186c35-S0/frameworks/363e6d80-8c38-46cf-815f-2fbf60a62628-0309/executors/mobius-mloop-1538549013_438156792-v2-shared-volume.pod1.writer-job.0.e93hs3uips2i9_1/runs/9e5770a7-9f78-46dc-9264-3e80be0e40cc'
> {noformat}
> Note, that when this happens, the isolator recursively applies the project ID 
> to the contents of the sandbox. It doesn't follow symlinks or cross devices 
> when it does this, but on Linux, a bind mount would not trigger either of 
> these conditions.
> Finally, the `disk/xfs` isolator tries to assign a project ID to the 
> persistent volume as it is used by the task:
> {noformat}
> F1003 06:49:21.920577 2812452 disk.cpp:532] Check failed: 
> scheduledProjects.contains(projectId.get()) untracked project ID 6806 for 
> volume ID 21cb2eb6-b3e5-46f2-944e-8f6e5db9f07f on 
> /srv/mesos/work/volumes/roles/pie.mobius/21cb2eb6-b3e5-46f2-944e-8f6e5db9f07f
> {noformat}
> This check fails, because if the persistent volume has a project ID, we 
> expect that is had already be scheduled for reclaimation. However, it's 
> project ID is the one we assigned to the sandbox. We don't scheduled the 
> ssandbox for reclaimation until cleanup, so (fortunately) the invariant check 
> triggers.
> So, apart from triggering the CHECK, the root cause of this is that we are 
> altering the project ID of the persistent volume, which permanently 
> misattributes the corresponding quote.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9300) XFS isolator can mislabel project IDs on persistence volumes.

2018-10-08 Thread James Peach (JIRA)
James Peach created MESOS-9300:
--

 Summary: XFS isolator can mislabel project IDs on persistence 
volumes.
 Key: MESOS-9300
 URL: https://issues.apache.org/jira/browse/MESOS-9300
 Project: Mesos
  Issue Type: Bug
  Components: agent
Reporter: James Peach
Assignee: James Peach


What happens here is that we are erroneously applying the sandbox's project ID 
to the persistent volume.

First, the filesystem/linux isolator bind mounts the persistent volume into the 
sandbox:

{noformat}
I1003 06:49:21.907644 2812466 linux.cpp:593] Mounting 
'/srv/mesos/work/volumes/roles/pie.mobius/21cb2eb6-b3e5-46f2-944e-8f6e5db9f07f' 
to 
'/srv/mesos/work/slaves/909cff92-8e17-41bf-a251-9b5eb6186c35-S0/frameworks/363e6d80-8c38-46cf-815f-2fbf60a62628-0309/executors/mobius-mloop-1538549013_438156792-v2-shared-volume.pod1.writer-job.0.e93hs3uips2i9_1/runs/9e5770a7-9f78-46dc-9264-3e80be0e40cc/shared'
 for persistent volume disk(allocated: pie.mobius)(reservations: 
[(DYNAMIC,pie.mobius,jarvis-principal,\{podInstance: e93hs3uips2i9, pod: pod1, 
service: 
mobius-mloop-1538549013_438156792-v2-shared-volume})])[21cb2eb6-b3e5-46f2-944e-8f6e5db9f07f:shared]:1
 of container 9e5770a7-9f78-46dc-9264-3e80be0e40cc
{noformat}

Next, the `disk/xfs` isolator assigns a project ID to the sandbox:

{noformat}
I1003 06:49:21.920197 2812452 disk.cpp:402] Assigned project 6806 to 
'/srv/mesos/work/slaves/909cff92-8e17-41bf-a251-9b5eb6186c35-S0/frameworks/363e6d80-8c38-46cf-815f-2fbf60a62628-0309/executors/mobius-mloop-1538549013_438156792-v2-shared-volume.pod1.writer-job.0.e93hs3uips2i9_1/runs/9e5770a7-9f78-46dc-9264-3e80be0e40cc'
{noformat}

Note, that when this happens, the isolator recursively applies the project ID 
to the contents of the sandbox. It doesn't follow symlinks or cross devices 
when it does this, but on Linux, a bind mount would not trigger either of these 
conditions.

Finally, the `disk/xfs` isolator tries to assign a project ID to the persistent 
volume as it is used by the task:

{noformat}
F1003 06:49:21.920577 2812452 disk.cpp:532] Check failed: 
scheduledProjects.contains(projectId.get()) untracked project ID 6806 for 
volume ID 21cb2eb6-b3e5-46f2-944e-8f6e5db9f07f on 
/srv/mesos/work/volumes/roles/pie.mobius/21cb2eb6-b3e5-46f2-944e-8f6e5db9f07f
{noformat}

This check fails, because if the persistent volume has a project ID, we expect 
that is had already be scheduled for reclaimation. However, it's project ID is 
the one we assigned to the sandbox. We don't scheduled the ssandbox for 
reclaimation until cleanup, so (fortunately) the invariant check triggers.

So, apart from triggering the CHECK, the root cause of this is that we are 
altering the project ID of the persistent volume, which permanently 
misattributes the corresponding quote.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-895) Unbundle libev.

2018-09-21 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624201#comment-16624201
 ] 

James Peach commented on MESOS-895:
---

{noformat}
commit 0b9861e356ec2d7d50163ae54a6be9c1c45f279b
Author: James Peach 
Date:   Fri Sep 21 14:13:29 2018 -0700

Removed bundled libev patch.

Since we now disable the libev SIGCHLD handler at runtime, we no longer
need to bundle the patch to do it at build time. It is still useful to
bundle libev itself, to support older distributions.

Review: https://reviews.apache.org/r/68800/
{noformat}

> Unbundle libev.
> ---
>
> Key: MESOS-895
> URL: https://issues.apache.org/jira/browse/MESOS-895
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.17.0
>Reporter: Timothy St. Clair
>Assignee: James Peach
>Priority: Major
>  Labels: tech-debt
>
> The libev patch can easily be removed and update the configuration flags and 
> possibly the accompanying code prior to include.   
> For configure pass in: 
> CFLAGS=-DEV_CHILD_ENABLE=0
> For inclusion: 
> #define EV_CHILD_ENABLE 0
> include 
> excerpt from maintainer: 
>  that patch is unnecessary
>  schmorp, so if they wanted to just set EV_CHILD_ENABLE=0 they 
> could just pass CFLAGS=-DEV_CHILD_ENABLE=0  through.
>  tstclair: yes, or use a wrapper



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-895) Unbundle libev.

2018-09-21 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624201#comment-16624201
 ] 

James Peach edited comment on MESOS-895 at 9/21/18 9:21 PM:


{noformat}
commit 0b9861e356ec2d7d50163ae54a6be9c1c45f279b
Author: James Peach 
Date:   Fri Sep 21 14:13:29 2018 -0700

Removed bundled libev patch.

Since we now disable the libev SIGCHLD handler at runtime, we no longer
need to bundle the patch to do it at build time. It is still useful to
bundle libev itself, to support older distributions.

Review: https://reviews.apache.org/r/68800/
{noformat}


was (Author: jamespeach):
{noformat}
commit 0b9861e356ec2d7d50163ae54a6be9c1c45f279b
Author: James Peach 
Date:   Fri Sep 21 14:13:29 2018 -0700

Removed bundled libev patch.

Since we now disable the libev SIGCHLD handler at runtime, we no longer
need to bundle the patch to do it at build time. It is still useful to
bundle libev itself, to support older distributions.

Review: https://reviews.apache.org/r/68800/
{noformat}

> Unbundle libev.
> ---
>
> Key: MESOS-895
> URL: https://issues.apache.org/jira/browse/MESOS-895
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.17.0
>Reporter: Timothy St. Clair
>Assignee: James Peach
>Priority: Major
>  Labels: tech-debt
>
> The libev patch can easily be removed and update the configuration flags and 
> possibly the accompanying code prior to include.   
> For configure pass in: 
> CFLAGS=-DEV_CHILD_ENABLE=0
> For inclusion: 
> #define EV_CHILD_ENABLE 0
> include 
> excerpt from maintainer: 
>  that patch is unnecessary
>  schmorp, so if they wanted to just set EV_CHILD_ENABLE=0 they 
> could just pass CFLAGS=-DEV_CHILD_ENABLE=0  through.
>  tstclair: yes, or use a wrapper



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9246) Verify libarchive version at configuration time.

2018-09-20 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16622567#comment-16622567
 ] 

James Peach commented on MESOS-9246:


/cc [~andschwa]

> Verify libarchive version at configuration time.
> 
>
> Key: MESOS-9246
> URL: https://issues.apache.org/jira/browse/MESOS-9246
> Project: Mesos
>  Issue Type: Bug
>Reporter: James Peach
>Priority: Major
>
> The Mesos build system doesn't verify that {{libarchive}} is a new enough 
> version to provide all the APIs that Mesos needs. For example, on CentOS 6 
> with {{libarchive}} 2.8.3, the build will fail:
> {noformat}
> ../../3rdparty/stout/include/stout/archiver.hpp: In function 'Try 
> archiver::extract(const string&, const string&, int)':
> ../../3rdparty/stout/include/stout/archiver.hpp:55:47: error: 
> 'archive_read_support_filter_all' was not declared in this scope
>archive_read_support_filter_all(reader.get());
>^
> ../../3rdparty/stout/include/stout/archiver.hpp: In lambda function:
> ../../3rdparty/stout/include/stout/archiver.hpp:61:27: error: 
> 'archive_write_free' was not declared in this scope
>archive_write_free(p);
>^
> ../../3rdparty/stout/include/stout/archiver.hpp: In function 'Try 
> archiver::extract(const string&, const string&, int)':
> ../../3rdparty/stout/include/stout/archiver.hpp:120:70: error: 
> 'archive_entry_hardlink_utf8' was not declared in this scope
>const char* hardlink_target = archive_entry_hardlink_utf8(entry);
>   ^
> ../../3rdparty/stout/include/stout/archiver.hpp:130:68: error: 
> 'archive_entry_pathname_utf8' was not declared in this scope
>path::join(destination, 
> archive_entry_pathname_utf8(entry)).c_str());
> {noformat}
> We should verify that new APIs we need are present at configuration time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9246) Verify libarchive version at configuration time.

2018-09-20 Thread James Peach (JIRA)
James Peach created MESOS-9246:
--

 Summary: Verify libarchive version at configuration time.
 Key: MESOS-9246
 URL: https://issues.apache.org/jira/browse/MESOS-9246
 Project: Mesos
  Issue Type: Bug
Reporter: James Peach


The Mesos build system doesn't verify that {{libarchive}} is a new enough 
version to provide all the APIs that Mesos needs. For example, on CentOS 6 with 
{{libarchive}} 2.8.3, the build will fail:

{noformat}
../../3rdparty/stout/include/stout/archiver.hpp: In function 'Try 
archiver::extract(const string&, const string&, int)':
../../3rdparty/stout/include/stout/archiver.hpp:55:47: error: 
'archive_read_support_filter_all' was not declared in this scope
   archive_read_support_filter_all(reader.get());
   ^
../../3rdparty/stout/include/stout/archiver.hpp: In lambda function:
../../3rdparty/stout/include/stout/archiver.hpp:61:27: error: 
'archive_write_free' was not declared in this scope
   archive_write_free(p);
   ^
../../3rdparty/stout/include/stout/archiver.hpp: In function 'Try 
archiver::extract(const string&, const string&, int)':
../../3rdparty/stout/include/stout/archiver.hpp:120:70: error: 
'archive_entry_hardlink_utf8' was not declared in this scope
   const char* hardlink_target = archive_entry_hardlink_utf8(entry);
  ^
../../3rdparty/stout/include/stout/archiver.hpp:130:68: error: 
'archive_entry_pathname_utf8' was not declared in this scope
   path::join(destination, archive_entry_pathname_utf8(entry)).c_str());
{noformat}

We should verify that new APIs we need are present at configuration time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9240) CSI protobuf build fails when dependency tracking is disabled.

2018-09-17 Thread James Peach (JIRA)
James Peach created MESOS-9240:
--

 Summary: CSI protobuf build fails when dependency tracking is 
disabled.
 Key: MESOS-9240
 URL: https://issues.apache.org/jira/browse/MESOS-9240
 Project: Mesos
  Issue Type: Bug
  Components: build
Reporter: James Peach
Assignee: James Peach


Generating the CSI protobufs depends on the "$(builddir)/include/csi" directory 
being created as configuration time. This only happens when automate build 
dependencies are enabled, however. By default, rpmbuild will pass 
{{\--disable-dependency-tracking}}, which will prevent this directory being 
created, and the build will fail like so:
{noformat}
./../include/mesos/v1/master/master.proto
/usr/bin/protoc -I../../include -I../../src -I../3rdparty/csi-0.2.0 
--cpp_out=../include ../../include/mesos/v1/quota/quota.proto
/usr/bin/protoc -I../../include -I../../src -I../3rdparty/csi-0.2.0 
--cpp_out=../include 
../../include/mesos/v1/resource_provider/resource_provider.proto
../include/csi/: No such file or directory
/usr/bin/protoc -I../../include -I../../src -I../3rdparty/csi-0.2.0 
--cpp_out=../include ../../include/mesos/v1/scheduler/scheduler.proto
/usr/bin/protoc -I../../include -I../../src -I../3rdparty/csi-0.2.0 --cpp_out=. 
../../src/master/registry.proto
make[2]: *** [../include/csi/csi.grpc.pb.cc] Error 1
make[2]: *** Waiting for unfinished jobs
../include/csi/: No such file or directory
make[2]: *** [../include/csi/csi.pb.cc] Error 1
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-895) Unbundle libev.

2018-09-11 Thread James Peach (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-895:
-

Assignee: James Peach  (was: Timothy St. Clair)

CentOS 6 ships {{libev}} 4.03 and and Ubuntu 14.04 ships 4.15, so once 
MESOS-9212 lands, I think we can unbundle {{libev}}.

/cc [~tillt] [~bmahler] [~vinodkone]

> Unbundle libev.
> ---
>
> Key: MESOS-895
> URL: https://issues.apache.org/jira/browse/MESOS-895
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.17.0
>Reporter: Timothy St. Clair
>Assignee: James Peach
>Priority: Major
>  Labels: tech-debt
>
> The libev patch can easily be removed and update the configuration flags and 
> possibly the accompanying code prior to include.   
> For configure pass in: 
> CFLAGS=-DEV_CHILD_ENABLE=0
> For inclusion: 
> #define EV_CHILD_ENABLE 0
> include 
> excerpt from maintainer: 
>  that patch is unnecessary
>  schmorp, so if they wanted to just set EV_CHILD_ENABLE=0 they 
> could just pass CFLAGS=-DEV_CHILD_ENABLE=0  through.
>  tstclair: yes, or use a wrapper



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9178) Add a metric for master failover time.

2018-09-11 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16610795#comment-16610795
 ] 

James Peach commented on MESOS-9178:


Another way to measure this is to publish it in the event stream.

> Add a metric for master failover time.
> --
>
> Key: MESOS-9178
> URL: https://issues.apache.org/jira/browse/MESOS-9178
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Xudong Ni
>Assignee: Xudong Ni
>Priority: Minor
>
> Quote from Yan Xu: Previous the argument against it is that you don't know if 
> all agents are going to come back after a master failover so there's not a 
> certain point that marks the end of "full reregistration of all agents". 
> However empirically the number of agents usually don't change during the 
> failover and there's an upper bound of such wait (after a 10min timeout the 
> agents that haven't reregistered are going to be marked unreachable so we can 
> just use that to stop the timer.
> So we can define failover time as "the time it takes for all agents recovered 
> from the registry to be accounted for" i.e., either reregistered or marked as 
> unreachable.
> This is of course looking at failover from an agent reregistration 
> perspective.
> Later after we add framework info persistence, we can similarly define the 
> framework perspective using reregistration time or reconciliation time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9178) Add a metric for master failover time.

2018-09-10 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16609911#comment-16609911
 ] 

James Peach commented on MESOS-9178:


Say you have a time-series gauge at various percentages as per [~bmahler]'s 
suggestion. The gauge value would have to persist, so once it is set, it would 
remain at that value thereafter. If you needed to do analytics, you need to 
carefully choose the first sample after a failover. For time-series, the 
easiest thing to do is to plot it, and it's not at all clear to me how you 
could do that and show a meaningful graph because what you really want is to 
compare the historical failover times. I'm not that experienced with Grafana 
but I don't know how I would do that.

> Add a metric for master failover time.
> --
>
> Key: MESOS-9178
> URL: https://issues.apache.org/jira/browse/MESOS-9178
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Xudong Ni
>Assignee: Xudong Ni
>Priority: Minor
>
> Quote from Yan Xu: Previous the argument against it is that you don't know if 
> all agents are going to come back after a master failover so there's not a 
> certain point that marks the end of "full reregistration of all agents". 
> However empirically the number of agents usually don't change during the 
> failover and there's an upper bound of such wait (after a 10min timeout the 
> agents that haven't reregistered are going to be marked unreachable so we can 
> just use that to stop the timer.
> So we can define failover time as "the time it takes for all agents recovered 
> from the registry to be accounted for" i.e., either reregistered or marked as 
> unreachable.
> This is of course looking at failover from an agent reregistration 
> perspective.
> Later after we add framework info persistence, we can similarly define the 
> framework perspective using reregistration time or reconciliation time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9178) Add a metric for master failover time.

2018-09-07 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16607548#comment-16607548
 ] 

James Peach commented on MESOS-9178:


Are we convinced that a metric is the right approach? This seems like something 
that you might want to compare over long time periods which might be more 
suitable to doing analytics on logs

> Add a metric for master failover time.
> --
>
> Key: MESOS-9178
> URL: https://issues.apache.org/jira/browse/MESOS-9178
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Xudong Ni
>Assignee: Xudong Ni
>Priority: Minor
>
> Quote from Yan Xu: Previous the argument against it is that you don't know if 
> all agents are going to come back after a master failover so there's not a 
> certain point that marks the end of "full reregistration of all agents". 
> However empirically the number of agents usually don't change during the 
> failover and there's an upper bound of such wait (after a 10min timeout the 
> agents that haven't reregistered are going to be marked unreachable so we can 
> just use that to stop the timer.
> So we can define failover time as "the time it takes for all agents recovered 
> from the registry to be accounted for" i.e., either reregistered or marked as 
> unreachable.
> This is of course looking at failover from an agent reregistration 
> perspective.
> Later after we add framework info persistence, we can similarly define the 
> framework perspective using reregistration time or reconciliation time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9212) Disable SIGCHILD handling in libev.

2018-09-06 Thread James Peach (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-9212:
--

Assignee: James Peach

| [r/68660|https://reviews.apache.org/r/68660] | Disabled SIGCHLD handling in 
the libev event loop. |

> Disable SIGCHILD handling in libev.
> ---
>
> Key: MESOS-9212
> URL: https://issues.apache.org/jira/browse/MESOS-9212
> Project: Mesos
>  Issue Type: Bug
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> On Fedora 28, building against the system version of libev (version 4.24) 
> causes the following tests to fail:
> The following tests fail:
> {noformat}
> [  FAILED  ] ReapTest.NonChildProcess
> [  FAILED  ] ReapTest.ChildProcess
> [  FAILED  ] ReapTest.TerminatedChildProcess
> [  FAILED  ] SubprocessTest.PipeOutputToFileDescriptor
> [  FAILED  ] SubprocessTest.PipeOutputToPath
> [  FAILED  ] SubprocessTest.EnvironmentEcho
> [  FAILED  ] SubprocessTest.Status
> [  FAILED  ] SubprocessTest.PipeOutput
> [  FAILED  ] SubprocessTest.PipeLargeOutput
> [  FAILED  ] SubprocessTest.PipeInput
> [  FAILED  ] SubprocessTest.PipeRedirect
> [  FAILED  ] SubprocessTest.PathOutput
> [  FAILED  ] SubprocessTest.PathInput
> [  FAILED  ] SubprocessTest.FdOutput
> [  FAILED  ] SubprocessTest.FdInput
> [  FAILED  ] SubprocessTest.Default
> [  FAILED  ] SubprocessTest.Flags
> [  FAILED  ] SubprocessTest.Environment
> [  FAILED  ] SubprocessTest.EnvironmentWithSpaces
> [  FAILED  ] SubprocessTest.EnvironmentWithSpacesAndQuotes
> [  FAILED  ] SubprocessTest.EnvironmentOverride
> {noformat}
> This build configuration succeeds:
> {noformat}
> $ ../configure --disable-java --disable-python --enable-silent-rules 
> --disable-hardening --disable-werror --disable-libtool-wrappers 
> --enable-xfs-disk-isolator --enable-install-module-dependencies 
> --enable-port-mapping-isolator --enable-network-ports-isolator 
> --with-protobuf=/usr --with-curl=/usr --with-libarchive=/usr 
> --with-zookeeper=/usr --prefix=/opt/mesos "CXXFLAGS=-O0 -ggdb3 
> -fno-omit-frame-pointer -fvisibility-inlines-hidden 
> -Wno-unused-local-typedefs -Wno-deprecated" "CFLAGS=-O0 -ggdb3 
> -fno-omit-frame-pointer -Wno-unused-local-typedefs -Wno-deprecated" LDFLAGS= 
> CXX=/home/jpeach/src/asf-mesos/build/c++ 
> CC=/home/jpeach/src/asf-mesos/build/cc LD=/home/jpeach/src/asf-mesos/build/ld
> {noformat}
> This build configuration fails:
> {noformat}
>   $ ../configure --disable-java --disable-python --enable-silent-rules 
> --disable-hardening --disable-werror --disable-libtool-wrappers 
> --enable-xfs-disk-isolator --enable-install-module-dependencies 
> --enable-port-mapping-isolator --enable-network-ports-isolator 
> --with-protobuf=/usr --with-curl=/usr --with-libarchive=/usr 
> --with-zookeeper=/usr --prefix=/opt/mesos "CXXFLAGS=-O0 -ggdb3 
> -fno-omit-frame-pointer -fvisibility-inlines-hidden 
> -Wno-unused-local-typedefs -Wno-deprecated" "CFLAGS=-O0 -ggdb3 
> -fno-omit-frame-pointer -Wno-unused-local-typedefs -Wno-deprecated" LDFLAGS= 
> CXX=/home/jpeach/src/asf-mesos/build/c++ 
> CC=/home/jpeach/src/asf-mesos/build/cc LD=/home/jpeach/src/asf-mesos/build/ld 
> --with-libev=/usr
> {noformat}
> I think what happens here is that the child process gets reaped wrongly 
> somehow:
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from SubprocessTest
> [ RUN  ] SubprocessTest.EnvironmentWithSpaces
> [pid 25909] clone(child_stack=NULL, 
> flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, 
> child_tidptr=0x7fa11881fcd0) = 25923
> strace: Process 25923 attached
> [pid 25923] execve("/usr/bin/sh", ["sh", "-c", "echo $MESSAGE"], 0x1ff3950 /* 
> 1 var */) = 0
> [pid 25923] arch_prctl(ARCH_SET_FS, 0x7f24561c5740) = 0
> [pid 25923] exit_group(0)   = ?
> [pid 25923] +++ exited with 0 +++
> [pid 25909] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=25923, 
> si_uid=9306, si_status=0, si_utime=0, si_stime=0} ---
> [pid 25922] wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 
> WNOHANG|WSTOPPED|WCONTINUED, NULL) = 25923
> [pid 25922] wait4(-1, 0x7fa10a74da44, WNOHANG|WSTOPPED|WCONTINUED, NULL) = -1 
> ECHILD (No child processes)
> [pid 25919] wait4(25923, 0x7fa10bf50548, WNOHANG, NULL) = -1 ECHILD (No child 
> processes)
> ../../../3rdparty/libprocess/src/tests/subprocess_tests.cpp:977: Failure
> (s->status()).get() is NONE
> [  FAILED  ] SubprocessTest.EnvironmentWithSpaces (12 ms)
> [--] 1 test from SubprocessTest (12 ms total)
> [--] Global test environment tear-down
> [==] 1 test from 1 test case ran. (12 ms total)
> [  PASSED  ] 0 tests.
> [  FAILED  ] 1 test, listed below:
> [  FAILED  ] SubprocessTest.EnvironmentWithSpaces
> {noformat}



--
This message was sent by 

[jira] [Commented] (MESOS-9212) Subprocess tests fail with libev 4.24

2018-09-06 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16606233#comment-16606233
 ] 

James Peach commented on MESOS-9212:


This might be due to the libev patch we are carrying?

{noformat}
[jpeach@jpeach 3rdparty]$ cat libev-4.22.patch
diff --git a/ev.h b/ev.h
index 38f62d8..0055cfd 100644
--- a/ev.h
+++ b/ev.h
@@ -125,7 +125,7 @@ EV_CPP(extern "C" {)
 # ifdef _WIN32
 #  define EV_CHILD_ENABLE 0
 # else
-#  define EV_CHILD_ENABLE EV_FEATURE_WATCHERS
+#  define EV_CHILD_ENABLE 0
 #endif
 #endif
[jpeach@jpeach 3rdparty]$ grep -r EV_CHILD_ENABLE /usr/include/
/usr/include/ev.h:#ifndef EV_CHILD_ENABLE
/usr/include/ev.h:#  define EV_CHILD_ENABLE 0
/usr/include/ev.h:#  define EV_CHILD_ENABLE EV_FEATURE_WATCHERS
/usr/include/ev.h:#if EV_CHILD_ENABLE && !EV_SIGNAL_ENABLE
/usr/include/ev.h:# if EV_CHILD_ENABLE
/usr/include/ev++.h:  #if EV_CHILD_ENABLE
{noformat}

> Subprocess tests fail with libev 4.24
> -
>
> Key: MESOS-9212
> URL: https://issues.apache.org/jira/browse/MESOS-9212
> Project: Mesos
>  Issue Type: Bug
>Reporter: James Peach
>Priority: Major
>
> On Fedora 28, building against the system version of libev (version 4.24) 
> causes the following tests to fail:
> The following tests fail:
> {noformat}
> [  FAILED  ] ReapTest.NonChildProcess
> [  FAILED  ] ReapTest.ChildProcess
> [  FAILED  ] ReapTest.TerminatedChildProcess
> [  FAILED  ] SubprocessTest.PipeOutputToFileDescriptor
> [  FAILED  ] SubprocessTest.PipeOutputToPath
> [  FAILED  ] SubprocessTest.EnvironmentEcho
> [  FAILED  ] SubprocessTest.Status
> [  FAILED  ] SubprocessTest.PipeOutput
> [  FAILED  ] SubprocessTest.PipeLargeOutput
> [  FAILED  ] SubprocessTest.PipeInput
> [  FAILED  ] SubprocessTest.PipeRedirect
> [  FAILED  ] SubprocessTest.PathOutput
> [  FAILED  ] SubprocessTest.PathInput
> [  FAILED  ] SubprocessTest.FdOutput
> [  FAILED  ] SubprocessTest.FdInput
> [  FAILED  ] SubprocessTest.Default
> [  FAILED  ] SubprocessTest.Flags
> [  FAILED  ] SubprocessTest.Environment
> [  FAILED  ] SubprocessTest.EnvironmentWithSpaces
> [  FAILED  ] SubprocessTest.EnvironmentWithSpacesAndQuotes
> [  FAILED  ] SubprocessTest.EnvironmentOverride
> {noformat}
> This build configuration succeeds:
> {noformat}
> $ ../configure --disable-java --disable-python --enable-silent-rules 
> --disable-hardening --disable-werror --disable-libtool-wrappers 
> --enable-xfs-disk-isolator --enable-install-module-dependencies 
> --enable-port-mapping-isolator --enable-network-ports-isolator 
> --with-protobuf=/usr --with-curl=/usr --with-libarchive=/usr 
> --with-zookeeper=/usr --prefix=/opt/mesos "CXXFLAGS=-O0 -ggdb3 
> -fno-omit-frame-pointer -fvisibility-inlines-hidden 
> -Wno-unused-local-typedefs -Wno-deprecated" "CFLAGS=-O0 -ggdb3 
> -fno-omit-frame-pointer -Wno-unused-local-typedefs -Wno-deprecated" LDFLAGS= 
> CXX=/home/jpeach/src/asf-mesos/build/c++ 
> CC=/home/jpeach/src/asf-mesos/build/cc LD=/home/jpeach/src/asf-mesos/build/ld
> {noformat}
> This build configuration fails:
> {noformat}
>   $ ../configure --disable-java --disable-python --enable-silent-rules 
> --disable-hardening --disable-werror --disable-libtool-wrappers 
> --enable-xfs-disk-isolator --enable-install-module-dependencies 
> --enable-port-mapping-isolator --enable-network-ports-isolator 
> --with-protobuf=/usr --with-curl=/usr --with-libarchive=/usr 
> --with-zookeeper=/usr --prefix=/opt/mesos "CXXFLAGS=-O0 -ggdb3 
> -fno-omit-frame-pointer -fvisibility-inlines-hidden 
> -Wno-unused-local-typedefs -Wno-deprecated" "CFLAGS=-O0 -ggdb3 
> -fno-omit-frame-pointer -Wno-unused-local-typedefs -Wno-deprecated" LDFLAGS= 
> CXX=/home/jpeach/src/asf-mesos/build/c++ 
> CC=/home/jpeach/src/asf-mesos/build/cc LD=/home/jpeach/src/asf-mesos/build/ld 
> --with-libev=/usr
> {noformat}
> I think what happens here is that the child process gets reaped wrongly 
> somehow:
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from SubprocessTest
> [ RUN  ] SubprocessTest.EnvironmentWithSpaces
> [pid 25909] clone(child_stack=NULL, 
> flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, 
> child_tidptr=0x7fa11881fcd0) = 25923
> strace: Process 25923 attached
> [pid 25923] execve("/usr/bin/sh", ["sh", "-c", "echo $MESSAGE"], 0x1ff3950 /* 
> 1 var */) = 0
> [pid 25923] arch_prctl(ARCH_SET_FS, 0x7f24561c5740) = 0
> [pid 25923] exit_group(0)   = ?
> [pid 25923] +++ exited with 0 +++
> [pid 25909] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=25923, 
> si_uid=9306, si_status=0, si_utime=0, si_stime=0} ---
> [pid 25922] wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 
> WNOHANG|WSTOPPED|WCONTINUED, NULL) = 25923
> [pid 25922] wait4(-1, 0x7fa10a74da44, WNOHANG|WSTOPPED|WCONTINUED, NULL) = -1 
> 

[jira] [Created] (MESOS-9212) Subprocess tests fail with libev 4.24

2018-09-06 Thread James Peach (JIRA)
James Peach created MESOS-9212:
--

 Summary: Subprocess tests fail with libev 4.24
 Key: MESOS-9212
 URL: https://issues.apache.org/jira/browse/MESOS-9212
 Project: Mesos
  Issue Type: Bug
Reporter: James Peach


On Fedora 28, building against the system version of libev (version 4.24) 
causes the following tests to fail:

The following tests fail:
{noformat}
[  FAILED  ] ReapTest.NonChildProcess
[  FAILED  ] ReapTest.ChildProcess
[  FAILED  ] ReapTest.TerminatedChildProcess
[  FAILED  ] SubprocessTest.PipeOutputToFileDescriptor
[  FAILED  ] SubprocessTest.PipeOutputToPath
[  FAILED  ] SubprocessTest.EnvironmentEcho
[  FAILED  ] SubprocessTest.Status
[  FAILED  ] SubprocessTest.PipeOutput
[  FAILED  ] SubprocessTest.PipeLargeOutput
[  FAILED  ] SubprocessTest.PipeInput
[  FAILED  ] SubprocessTest.PipeRedirect
[  FAILED  ] SubprocessTest.PathOutput
[  FAILED  ] SubprocessTest.PathInput
[  FAILED  ] SubprocessTest.FdOutput
[  FAILED  ] SubprocessTest.FdInput
[  FAILED  ] SubprocessTest.Default
[  FAILED  ] SubprocessTest.Flags
[  FAILED  ] SubprocessTest.Environment
[  FAILED  ] SubprocessTest.EnvironmentWithSpaces
[  FAILED  ] SubprocessTest.EnvironmentWithSpacesAndQuotes
[  FAILED  ] SubprocessTest.EnvironmentOverride
{noformat}

This build configuration succeeds:
{noformat}
$ ../configure --disable-java --disable-python --enable-silent-rules 
--disable-hardening --disable-werror --disable-libtool-wrappers 
--enable-xfs-disk-isolator --enable-install-module-dependencies 
--enable-port-mapping-isolator --enable-network-ports-isolator 
--with-protobuf=/usr --with-curl=/usr --with-libarchive=/usr 
--with-zookeeper=/usr --prefix=/opt/mesos "CXXFLAGS=-O0 -ggdb3 
-fno-omit-frame-pointer -fvisibility-inlines-hidden -Wno-unused-local-typedefs 
-Wno-deprecated" "CFLAGS=-O0 -ggdb3 -fno-omit-frame-pointer 
-Wno-unused-local-typedefs -Wno-deprecated" LDFLAGS= 
CXX=/home/jpeach/src/asf-mesos/build/c++ CC=/home/jpeach/src/asf-mesos/build/cc 
LD=/home/jpeach/src/asf-mesos/build/ld
{noformat}

This build configuration fails:

{noformat}
  $ ../configure --disable-java --disable-python --enable-silent-rules 
--disable-hardening --disable-werror --disable-libtool-wrappers 
--enable-xfs-disk-isolator --enable-install-module-dependencies 
--enable-port-mapping-isolator --enable-network-ports-isolator 
--with-protobuf=/usr --with-curl=/usr --with-libarchive=/usr 
--with-zookeeper=/usr --prefix=/opt/mesos "CXXFLAGS=-O0 -ggdb3 
-fno-omit-frame-pointer -fvisibility-inlines-hidden -Wno-unused-local-typedefs 
-Wno-deprecated" "CFLAGS=-O0 -ggdb3 -fno-omit-frame-pointer 
-Wno-unused-local-typedefs -Wno-deprecated" LDFLAGS= 
CXX=/home/jpeach/src/asf-mesos/build/c++ CC=/home/jpeach/src/asf-mesos/build/cc 
LD=/home/jpeach/src/asf-mesos/build/ld --with-libev=/usr
{noformat}

I think what happens here is that the child process gets reaped wrongly somehow:
{noformat}
[==] Running 1 test from 1 test case.
[--] Global test environment set-up.
[--] 1 test from SubprocessTest
[ RUN  ] SubprocessTest.EnvironmentWithSpaces
[pid 25909] clone(child_stack=NULL, 
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, 
child_tidptr=0x7fa11881fcd0) = 25923
strace: Process 25923 attached
[pid 25923] execve("/usr/bin/sh", ["sh", "-c", "echo $MESSAGE"], 0x1ff3950 /* 1 
var */) = 0
[pid 25923] arch_prctl(ARCH_SET_FS, 0x7f24561c5740) = 0
[pid 25923] exit_group(0)   = ?
[pid 25923] +++ exited with 0 +++
[pid 25909] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=25923, 
si_uid=9306, si_status=0, si_utime=0, si_stime=0} ---
[pid 25922] wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 
WNOHANG|WSTOPPED|WCONTINUED, NULL) = 25923
[pid 25922] wait4(-1, 0x7fa10a74da44, WNOHANG|WSTOPPED|WCONTINUED, NULL) = -1 
ECHILD (No child processes)
[pid 25919] wait4(25923, 0x7fa10bf50548, WNOHANG, NULL) = -1 ECHILD (No child 
processes)
../../../3rdparty/libprocess/src/tests/subprocess_tests.cpp:977: Failure
(s->status()).get() is NONE
[  FAILED  ] SubprocessTest.EnvironmentWithSpaces (12 ms)
[--] 1 test from SubprocessTest (12 ms total)

[--] Global test environment tear-down
[==] 1 test from 1 test case ran. (12 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] SubprocessTest.EnvironmentWithSpaces
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9172) Fetcher deadlock with duplicated URIs.

2018-08-31 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598998#comment-16598998
 ] 

James Peach edited comment on MESOS-9172 at 8/31/18 4:46 PM:
-

| [r/68587|https://reviews.apache.org/r/68587] | Fixed fetcher deadlock with 
duplicate URIs. |
| [r/68586|https://reviews.apache.org/r/68586] | Add the output file to the 
hash on CommandInfo::URI. |


was (Author: jamespeach):
| [r/68587|https://reviews.apache.org/*r/68587] | Fixed fetcher deadlock with 
duplicate URIs. |
| [r/68586|https://reviews.apache.org/*r/68586] | Add the output file to the 
hash on CommandInfo::URI. |

> Fetcher deadlock with duplicated URIs.
> --
>
> Key: MESOS-9172
> URL: https://issues.apache.org/jira/browse/MESOS-9172
> Project: Mesos
>  Issue Type: Bug
>  Components: fetcher
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> If the fetcher cache is empty and you launch a task that contains duplicate 
> URIs, the fetcher deadlocks waiting for the futures in 
> {{FetcherProcess::_fetch}}.
> What happens is that when the fetcher is setting up the initial match of 
> cache lookup futures in {{FetcherProcess::fetch}}, the duplicate URIs cause 
> cache hits on the placeholder cache entries. This code is assuming that there 
> is already an operation in flight that will populate the cache entry. 
> However, the cache is currently empty - the placeholder entry is caused by a 
> the duplicate in the task's URIs.
> When we await the futures in {{FetcherProcess::_fetch}}, we end up waiting 
> for the future that indicated the cache entry becomes populated, but that 
> won't ever happen because we need to make progress on the current fetching 
> batch in order to populate the cache entry. At this point we are live-locked.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9192) Mesos build fail on Ubuntu 14.04.

2018-08-29 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596866#comment-16596866
 ] 

James Peach commented on MESOS-9192:


Per [the docs|http://mesos.apache.org/documentation/latest/building/] we 
require clang >= 3.5. Maybe we ought to add a version check to the build like 
we did for GCC?

> Mesos build fail on Ubuntu 14.04.
> -
>
> Key: MESOS-9192
> URL: https://issues.apache.org/jira/browse/MESOS-9192
> Project: Mesos
>  Issue Type: Bug
>Reporter: Meng Zhu
>Priority: Major
>
> Ubuntu 14.04, clang3.4
> If I manually install protobuf-compiler, the build will pass.
> {noformat}
> make[3]: Entering directory 
> `/home/mengzhu/workspace/mesos_current/build/3rdparty'
> cd grpc-1.10.0 &&   \
>   
> CPPFLAGS="-I/home/mengzhu/workspace/mesos_current/build/3rdparty/protobuf-3.5.0/src
>\
> \
> \
> -Wno-array-bounds   \
> -I/usr/include/subversion-1 -I/usr/include/apr-1 
> -I/usr/include/apr-1.0   " \
>   CFLAGS="-g1 -O0"  \
>   CXXFLAGS="-g1 -O0 -Wno-inconsistent-missing-override -std=c++11"
>   \
>   make  \
> 
> /home/mengzhu/workspace/mesos_current/build/3rdparty/grpc-1.10.0/libs/opt/libgrpc++_unsecure.a
>  
> /home/mengzhu/workspace/mesos_current/build/3rdparty/grpc-1.10.0/libs/opt/libgrpc_unsecure.a
>  
> /home/mengzhu/workspace/mesos_current/build/3rdparty/grpc-1.10.0/libs/opt/libgpr.a
> \
> CC="clang"  \
> CXX="clang++"   \
> LD="clang"  \
> LDXX="clang++"  \
> 
> LDFLAGS="-L/home/mengzhu/workspace/mesos_current/build/3rdparty/protobuf-3.5.0/src/.libs
> \
> \
> \
>  "  \
> LDLIBS=""   \
> HAS_PKG_CONFIG=false\
> NO_PROTOC=false \
> 
> PROTOC="/home/mengzhu/workspace/mesos_current/build/3rdparty/protobuf-3.5.0/src/protoc"
> make[4]: Entering directory 
> `/home/mengzhu/workspace/mesos_current/build/3rdparty/grpc-1.10.0'
> DEPENDENCY ERROR
> The target you are trying to run requires protobuf 3.0.0+
> Your system doesn't have it, and neither does the third_party directory.
> Please consult INSTALL to get more information.
> If you need information about why these tests failed, run:
>   make run_dep_checks
> make[4]: *** [stop] Error 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9178) Add a metric for master failover time.

2018-08-22 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589159#comment-16589159
 ] 

James Peach commented on MESOS-9178:


/cc [~bmahler]

> Add a metric for master failover time.
> --
>
> Key: MESOS-9178
> URL: https://issues.apache.org/jira/browse/MESOS-9178
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Xudong Ni
>Assignee: Xudong Ni
>Priority: Minor
>
> Quote from Yan Xu: Previous the argument against it is that you don't know if 
> all agents are going to come back after a master failover so there's not a 
> certain point that marks the end of "full reregistration of all agents". 
> However empirically the number of agents usually don't change during the 
> failover and there's an upper bound of such wait (after a 10min timeout the 
> agents that haven't reregistered are going to be marked unreachable so we can 
> just use that to stop the timer.
> So we can define failover time as "the time it takes for all agents recovered 
> from the registry to be accounted for" i.e., either reregistered or marked as 
> unreachable.
> This is of course looking at failover from an agent reregistration 
> perspective.
> Later after we add framework info persistence, we can similarly define the 
> framework perspective using reregistration time or reconciliation time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9175) `Subprocess::FD` can leak file descriptors into child processes

2018-08-21 Thread James Peach (JIRA)
James Peach created MESOS-9175:
--

 Summary: `Subprocess::FD` can leak file descriptors into child 
processes
 Key: MESOS-9175
 URL: https://issues.apache.org/jira/browse/MESOS-9175
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Reporter: James Peach


When you use the {{subprocess}} API, you can use {{Subprocess::FD()}} to define 
how the standard IO streams are attached to the child process. The default type 
argument is {{IO::DUPLICATED}}. In that case, the descriptors are duplicated 
with {{dup(2)}} in the parent process. The new file descriptors will have their 
close-on-exec flag cleared and could then be inherited to undefined child 
processes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9172) Fetcher deadlock with duplicated URIs.

2018-08-21 Thread James Peach (JIRA)
James Peach created MESOS-9172:
--

 Summary: Fetcher deadlock with duplicated URIs.
 Key: MESOS-9172
 URL: https://issues.apache.org/jira/browse/MESOS-9172
 Project: Mesos
  Issue Type: Bug
  Components: fetcher
Reporter: James Peach
Assignee: James Peach


If the fetcher cache is empty and you launch a task that contains duplicate 
URIs, the fetcher deadlocks waiting for the futures in 
{{FetcherProcess::_fetch}}.

What happens is that when the fetcher is setting up the initial match of cache 
lookup futures in {{FetcherProcess::fetch}}, the duplicate URIs cause cache 
hits on the placeholder cache entries. This code is assuming that there is 
already an operation in flight that will populate the cache entry. However, the 
cache is currently empty - the placeholder entry is caused by a the duplicate 
in the task's URIs.

When we await the futures in {{FetcherProcess::_fetch}}, we end up waiting for 
the future that indicated the cache entry becomes populated, but that won't 
ever happen because we need to make progress on the current fetching batch in 
order to populate the cache entry. At this point we are live-locked.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9164) Subprocess should unset CLOEXEC on whitelisted fils descriptors

2018-08-17 Thread James Peach (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-9164:
--

Assignee: James Peach

> Subprocess should unset CLOEXEC on whitelisted fils descriptors
> ---
>
> Key: MESOS-9164
> URL: https://issues.apache.org/jira/browse/MESOS-9164
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> The libprocess subprocess API accepts a set of whitelisted file descriptors 
> that are supposed to  be inherited to the child process. On windows, these 
> are used, but otherwise the subprocess API just ignores them. We probably 
> should make sure that the API clears the {{CLOEXEC}} flag on this descriptors 
> so that they are inherited to the child.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9164) Subprocess should unset CLOEXEC on whitelisted fils descriptors

2018-08-17 Thread James Peach (JIRA)
James Peach created MESOS-9164:
--

 Summary: Subprocess should unset CLOEXEC on whitelisted fils 
descriptors
 Key: MESOS-9164
 URL: https://issues.apache.org/jira/browse/MESOS-9164
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Reporter: James Peach


The libprocess subprocess API accepts a set of whitelisted file descriptors 
that are supposed to  be inherited to the child process. On windows, these are 
used, but otherwise the subprocess API just ignores them. We probably should 
make sure that the API clears the {{CLOEXEC}} flag on this descriptors so that 
they are inherited to the child.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9161) Build bundled ZK with SOCK_CLOEXEC

2018-08-16 Thread James Peach (JIRA)
James Peach created MESOS-9161:
--

 Summary: Build bundled ZK with SOCK_CLOEXEC
 Key: MESOS-9161
 URL: https://issues.apache.org/jira/browse/MESOS-9161
 Project: Mesos
  Issue Type: Bug
  Components: build
 Environment: We should enable {{\--with-sock-cloexec}} in our bundled 
ZooKeeper client build to enable the fix for ZOOKEEPER-2338 (which opens 
sockets with the {{SOCK_CLOEXEC}} flag).
Reporter: James Peach






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-5158) Provide XFS quota support for persistent volumes.

2018-08-09 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575181#comment-16575181
 ] 

James Peach commented on MESOS-5158:


Working on this now.

> Provide XFS quota support for persistent volumes.
> -
>
> Key: MESOS-5158
> URL: https://issues.apache.org/jira/browse/MESOS-5158
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Yan Xu
>Assignee: James Peach
>Priority: Major
>
> Given that the lifecycle of persistent volumes is managed outside of the 
> isolator, we may need to further abstract out the quota management 
> functionality to do it outside the XFS isolator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9138) Crashes in ProcessTest.Process_BENCHMARK_DispatchDefer

2018-08-06 Thread James Peach (JIRA)
James Peach created MESOS-9138:
--

 Summary: Crashes in ProcessTest.Process_BENCHMARK_DispatchDefer
 Key: MESOS-9138
 URL: https://issues.apache.org/jira/browse/MESOS-9138
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Reporter: James Peach


The `ProcessTest.Process_BENCHMARK_DispatchDefer` benchmark crashes fairly 
regularly (though not deterministically).

{noformat}
[ RUN  ] ProcessTest.Process_BENCHMARK_DispatchDefer
Movable elapsed: 12.65446863100secs
../../../3rdparty/libprocess/src/tests/benchmarks.cpp:572: Failure
Failed to wait 15secs for promise.future()
benchmarks: ../../../3rdparty/libprocess/include/process/dispatch.hpp:354: auto 
process::dispatch(const PID &, Future 
(DispatchProcess::*)(const DispatchProcess::Copyable &), const 
DispatchProcess::Copyable &&)::(anonymous 
class)::operator()(std::unique_ptr >, typename std::decay::type 
&&, process::ProcessBase *) const: Assertion `t != nullptr' failed.
WARNING: Logging before InitGoogleLogging() is written to STDERR
F0806 15:16:43.668474 28956 process.cpp:3419] Check failed: state.load() == 
ProcessBase::State::BOTTOM || state.load() == ProcessBase::State::TERMINATING
*** Aborted at 1533593803 (unix time) try "date -d @1533593803" if you are 
using GNU date ***
*** Check failure stack trace: ***
PC: @ 0x7f24f4327feb __GI_raise
*** SIGABRT (@0x245a711c) received by PID 28956 (TID 0x7f24eda65700) from 
PID 28956; stack trace: ***
@ 0x7f24f540bfc0 (unknown)
@ 0x7f24f4327feb __GI_raise
@ 0x7f24f43125c1 __GI_abort
@ 0x7f24f4312491 __assert_fail_base.cold.0
@ 0x7f24f4320752 __GI___assert_fail
@   0x4a8988 
_ZZN7process8dispatchI7Nothing15DispatchProcessRKNS2_8CopyableES5_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSA_FS8_T1_EOT2_ENKUlSt10unique_ptrINS_7PromiseIS1_EESt14default_deleteISL_EEOS3_PNS_11ProcessBaseEE_clESO_SP_SR_
@   0x4a879b 
_ZN5cpp176invokeIZN7process8dispatchI7Nothing15DispatchProcessRKNS4_8CopyableES7_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSC_FSA_T1_EOT2_EUlSt10unique_ptrINS1_7PromiseIS3_EESt14default_deleteISN_EEOS5_PNS1_11ProcessBaseEE_JSQ_S5_ST_EEEDTclclsr3stdE7forwardIS9_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOS9_DpOSV_
@   0x4a871b 
_ZN6lambda8internal7PartialIZN7process8dispatchI7Nothing15DispatchProcessRKNS5_8CopyableES8_EENS2_6FutureIT_EERKNS2_3PIDIT0_EEMSD_FSB_T1_EOT2_EUlSt10unique_ptrINS2_7PromiseIS4_EESt14default_deleteISO_EEOS6_PNS2_11ProcessBaseEE_JSR_S6_St12_PlaceholderILi113invoke_expandISV_St5tupleIJSR_S6_SX_EES10_IJOSU_EEJLm0ELm1ELm2DTclsr5cpp17E6invokeclsr3stdE7forwardISA_Efp_Espcl6expandclsr3stdE3getIXT2_EEclsr3stdE7forwardISD_Efp0_EEclsr3stdE7forwardISH_Efp2_OSA_OSD_N5cpp1416integer_sequenceImJXspT2_OSH_
@   0x4a864e 
_ZNO6lambda8internal7PartialIZN7process8dispatchI7Nothing15DispatchProcessRKNS5_8CopyableES8_EENS2_6FutureIT_EERKNS2_3PIDIT0_EEMSD_FSB_T1_EOT2_EUlSt10unique_ptrINS2_7PromiseIS4_EESt14default_deleteISO_EEOS6_PNS2_11ProcessBaseEE_JSR_S6_St12_PlaceholderILi1clIJSU_EEEDTcl13invoke_expandclL_ZSt4moveIRSV_EONSt16remove_referenceISA_E4typeEOSA_EdtdefpT1fEclL_ZS10_IRSt5tupleIJSR_S6_SX_EEES15_S16_EdtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1ELm2_Eclsr3stdE16forward_as_tuplespclsr3stdE7forwardIT_Efp_DpOS1D_
@   0x4a85e2 
_ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchI7Nothing15DispatchProcessRKNS7_8CopyableESA_EENS4_6FutureIT_EERKNS4_3PIDIT0_EEMSF_FSD_T1_EOT2_EUlSt10unique_ptrINS4_7PromiseIS6_EESt14default_deleteISQ_EEOS8_PNS4_11ProcessBaseEE_JST_S8_St12_PlaceholderILi1EJSW_EEEDTclclsr3stdE7forwardISC_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSC_DpOS11_
@   0x4a85a6 
_ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8dispatchI7Nothing15DispatchProcessRKNS8_8CopyableESB_EENS5_6FutureIT_EERKNS5_3PIDIT0_EEMSG_FSE_T1_EOT2_EUlSt10unique_ptrINS5_7PromiseIS7_EESt14default_deleteISR_EEOS9_PNS5_11ProcessBaseEE_JSU_S9_St12_PlaceholderILi1EJSX_EEEvOSD_DpOT0_
@   0x4a855d 
_ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchI7Nothing15DispatchProcessRKNSB_8CopyableESE_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSJ_FSH_T1_EOT2_EUlSt10unique_ptrINS1_7PromiseISA_EESt14default_deleteISU_EEOSC_S3_E_JSX_SC_St12_PlaceholderILi1EEclEOS3_
@   0x721f58 
_ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_
@   0x721e19 process::ProcessBase::consume()
@   0x780169 
_ZNO7process13DispatchEvent7consumeEPNS_13EventConsumerE
@   0x41d7f4 process::ProcessBase::serve()
@   0x71d315 process::ProcessManager::resume()
@   0x7d8d8e 
process::ProcessManager::init_threads()::$_8::operator()()
@   0x7d8c4d 

[jira] [Created] (MESOS-9137) GRPC build fails to pass compiler flags

2018-08-06 Thread James Peach (JIRA)
James Peach created MESOS-9137:
--

 Summary: GRPC build fails to pass compiler flags
 Key: MESOS-9137
 URL: https://issues.apache.org/jira/browse/MESOS-9137
 Project: Mesos
  Issue Type: Bug
  Components: build
Reporter: James Peach


The GRPC build integration fails to pass compiler flags down from the main 
build into the GRPC component build. This can make the build fail in surprising 
ways.

For example, if you use {{CXXFLAGS="-fsanitize=thread" 
CFLAGS="-fsanitize=tthread"}}, the build fails because of the inconsistent 
application of these flags across bundled components.

In this build log, libprotobuf was built using the correct flags, which then 
causes GRPC to fail because it is missing the flags:

{noformat}
make[3]: Entering directory '/home/jpeach/src/asf-mesos/build/3rdparty'
   20 cd grpc-1.10.0 &&   \
   19   
CPPFLAGS="-I/home/jpeach/src/asf-mesos/build/3rdparty/protobuf-3.5.0/src
  \
   18 \
   17 \
   16 -Wno-array-bounds"  \
   15   make  \
   14 
/home/jpeach/src/asf-mesos/build/3rdparty/grpc-1.10.0/libs/opt/libgrpc++.a 
/home/jpeach/src/asf-mesos/build/3rdparty/grpc-1  .10.0/libs/opt/libgrpc.a 
/home/jpeach/src/asf-mesos/build/3rdparty/grpc-1.10.0/libs/opt/libgpr.a 
\
   13 CC="/home/jpeach/src/asf-mesos/build/cc"  
  \
   12 CXX="/home/jpeach/src/asf-mesos/build/c++"  \
   11 LD="/home/jpeach/src/asf-mesos/build/cc"  
  \
   10 LDXX="/home/jpeach/src/asf-mesos/build/c++" \
9 
LDFLAGS="-L/home/jpeach/src/asf-mesos/build/3rdparty/protobuf-3.5.0/src/.libs   
\
8 \
7  "  \
6 HAS_PKG_CONFIG=false\
5 NO_PROTOC=false \
4 
PROTOC="/home/jpeach/src/asf-mesos/build/3rdparty/protobuf-3.5.0/src/protoc"
3 make[4]: Entering directory 
'/home/jpeach/src/asf-mesos/build/3rdparty/grpc-1.10.0'
2 mkdir -p `dirname 
/home/jpeach/src/asf-mesos/build/3rdparty/grpc-1.10.0/bins/opt/grpc_cpp_plugin`
1 /home/jpeach/src/asf-mesos/build/c++ 
-L/home/jpeach/src/asf-mesos/build/3rdparty/protobuf-3.5.0/src/.libs
   
/home/jpeach/src/asf-mesos/build/3rdparty/grpc-1.10.0/objs/opt/src/compiler/cpp_plugin.o
 /home/j  
peach/src/asf-mesos/build/3rdparty/grpc-1.10.0/libs/opt/libgrpc_plugin_support.a
  -lprotoc -lprotobuf -ldl -lrt -lm -lpthread -  lz  -lprotoc -lprotobuf -o 
/home/jpeach/src/asf-mesos/build/3rdparty/grpc-1.10.0/bins/opt/grpc_cpp_plugin
31
/home/jpeach/src/asf-mesos/build/3rdparty/protobuf-3.5.0/src/.libs/libprotoc.a(code_generator.o):
 In function `__cxx_global_var  _init':
1 code_generator.cc:(.text.startup+0xd): undefined reference to 
`__tsan_func_entry'
2 code_generator.cc:(.text.startup+0x43): undefined reference to 
`__tsan_func_exit'
3 code_generator.cc:(.text.startup+0x57): undefined reference to 
`__tsan_func_exit'
4 
/home/jpeach/src/asf-mesos/build/3rdparty/protobuf-3.5.0/src/.libs/libprotoc.a(code_generator.o):
 In function `_GLOBAL__sub_I_c  ode_generator.cc':
5 code_generator.cc:(.text.startup+0x7d): undefined reference to 
`__tsan_func_entry'
6 code_generator.cc:(.text.startup+0x8c): undefined reference to 
`__tsan_func_exit'
7 code_generator.cc:(.text.startup+0xa0): undefined reference to 
`__tsan_func_exit'
8 
/home/jpeach/src/asf-mesos/build/3rdparty/protobuf-3.5.0/src/.libs/libprotoc.a(code_generator.o):
 In function `google::protobuf  
::compiler::CodeGenerator::~CodeGenerator()':
9 
code_generator.cc:(.text._ZN6google8protobuf8compiler13CodeGeneratorD0Ev+0x14): 
undefined reference to `__tsan_func_entry'
   10 
/home/jpeach/src/asf-mesos/build/3rdparty/protobuf-3.5.0/src/.libs/libprotoc.a(code_generator.o):
 In function `google::protobuf  
::compiler::CodeGenerator::GenerateAll(std::vector > 
const&, std::__cxx11::basic_string, 
std::allocator > const&, google::  
protobuf::compiler::GeneratorContext*, std::__cxx11::basic_string, std::allocator >*) const'  :
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9115) Stout depends on missing rapidjson headers.

2018-07-27 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559953#comment-16559953
 ] 

James Peach commented on MESOS-9115:


Summoning [~bmahler]

> Stout depends on missing rapidjson headers.
> ---
>
> Key: MESOS-9115
> URL: https://issues.apache.org/jira/browse/MESOS-9115
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: James Peach
>Priority: Major
>
> Stout depends on {{}} and {{}}, 
> and these eventually depend on files in {{}}. When we 
> install Mesos, we aren't installing the rapidjson internal headers, which 
> breaks the build for external Mesos modules.
> {noformat}
> 05:54:07 - In file included from /usr/include/stout/jsonify.hpp:36:0,
> 05:54:07 -  from /usr/include/stout/json.hpp:41,
> 05:54:07 -  from /usr/include/mesos/resources.hpp:37,
> 05:54:07 -  from /usr/include/mesos/slave/isolator.hpp:23,
> 05:54:07 -  from /usr/include/mesos/module/isolator.hpp:23,
> 05:54:07 -  from src/isolator.cc:8:
> 05:54:07 - /usr/include/rapidjson/stringbuffer.h:19:28: fatal error: 
> internal/stack.h: No such file or directory
> 05:54:07 -  #include "internal/stack.h"
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9115) Stout depends on missing rapidjson headers.

2018-07-27 Thread James Peach (JIRA)
James Peach created MESOS-9115:
--

 Summary: Stout depends on missing rapidjson headers.
 Key: MESOS-9115
 URL: https://issues.apache.org/jira/browse/MESOS-9115
 Project: Mesos
  Issue Type: Bug
  Components: build
Reporter: James Peach


Stout depends on {{}} and {{}}, 
and these eventually depend on files in {{}}. When we 
install Mesos, we aren't installing the rapidjson internal headers, which 
breaks the build for external Mesos modules.

{noformat}
05:54:07 - In file included from /usr/include/stout/jsonify.hpp:36:0,
05:54:07 -  from /usr/include/stout/json.hpp:41,
05:54:07 -  from /usr/include/mesos/resources.hpp:37,
05:54:07 -  from /usr/include/mesos/slave/isolator.hpp:23,
05:54:07 -  from /usr/include/mesos/module/isolator.hpp:23,
05:54:07 -  from src/isolator.cc:8:
05:54:07 - /usr/include/rapidjson/stringbuffer.h:19:28: fatal error: 
internal/stack.h: No such file or directory
05:54:07 -  #include "internal/stack.h"
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9065) Apply the `override` keyword globally.

2018-07-09 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16537947#comment-16537947
 ] 

James Peach edited comment on MESOS-9065 at 7/10/18 3:38 AM:
-

|[https://reviews.apache.org/r/67866/] | Apply the `override` keyword to stout.|
|[https://reviews.apache.org/r/67867/] | Apply the `override` keyword to 
libprocess.|
|[https://reviews.apache.org/r/67868/] |Apply the `override` keyword to Mesos. |
|[https://reviews.apache.org/r/67869/] |Add use of `override` to the Mesos C++ 
style guide. |



was (Author: jamespeach):
|https://reviews.apache.org/r/67866/ | Apply the `override` keyword to stout.|
|https://reviews.apache.org/r/67867/ | Apply the `override` keyword to 
libprocess.|
|https://reviews.apache.org/r/67868/ |Apply the `override` keyword to Mesos. |
|https://reviews.apache.org/r/67869/ |Add use of `override` to the Mesos C++ 
style guide. |


> Apply the `override` keyword globally.
> --
>
> Key: MESOS-9065
> URL: https://issues.apache.org/jira/browse/MESOS-9065
> Project: Mesos
>  Issue Type: Bug
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> As per [this 
> thread|https://lists.apache.org/thread.html/371c23ca743dbc354fcf440d1fa9e99c29f20602c5efd7dc563713a9@%3Cdev.mesos.apache.org%3E],
>  apply the {{override}} keyword globally.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9065) Apply the `override` keyword globally.

2018-07-09 Thread James Peach (JIRA)
James Peach created MESOS-9065:
--

 Summary: Apply the `override` keyword globally.
 Key: MESOS-9065
 URL: https://issues.apache.org/jira/browse/MESOS-9065
 Project: Mesos
  Issue Type: Bug
Reporter: James Peach
Assignee: James Peach


As per [this 
thread|https://lists.apache.org/thread.html/371c23ca743dbc354fcf440d1fa9e99c29f20602c5efd7dc563713a9@%3Cdev.mesos.apache.org%3E],
 apply the {{override}} keyword globally.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9057) Add a cmake option to disable -Werror.

2018-07-08 Thread James Peach (JIRA)
James Peach created MESOS-9057:
--

 Summary: Add a cmake option to disable -Werror.
 Key: MESOS-9057
 URL: https://issues.apache.org/jira/browse/MESOS-9057
 Project: Mesos
  Issue Type: Bug
  Components: build, cmake
Reporter: James Peach


The autotools build has a {{\-\-disable-werror}} build option that disables the 
{{-Werror}} compile flag in Mesos and its dependencies. We need to so the same 
for cmake so that this doesn't block upgrading compilers or other dependencies.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9051) Move agent call validation into common validation library.

2018-07-05 Thread James Peach (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-9051:
--

Assignee: James Peach

| [https://reviews.apache.org/r/67830/] | Moved `executor::Call` validation to 
common validation library. |

> Move agent call validation into common validation library.
> --
>
> Key: MESOS-9051
> URL: https://issues.apache.org/jira/browse/MESOS-9051
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, build
>Reporter: James Peach
>Assignee: James Peach
>Priority: Minor
>
> The executor driver calls {{executor::call::validate()}} from 
> {{src/slave/validation.cpp}}, which creates an upward dependency from 
> libmesos.so (where the executor driver has to live) to the agent. If we can 
> move the validation calls down to the common validation library, we can break 
> this dependency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9040) Break scheduler driver dependency on mesos-local.

2018-07-04 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533262#comment-16533262
 ] 

James Peach commented on MESOS-9040:


[~benjaminhindman] [~tillt] What do you think about just removing this 
features? It's not documented and I don't know of anyone who uses it (no-one on 
the dev list responded, though we should try harder to let people know if we 
are going to remove it). With the advent of the HTTP API, maybe there are fewer 
users of the scheduler drivers, so this is less likely to benefit framework 
developers. I can also take an action to add some docs about integration 
testing with {{mesos-local}}.

> Break scheduler driver dependency on mesos-local.
> -
>
> Key: MESOS-9040
> URL: https://issues.apache.org/jira/browse/MESOS-9040
> Project: Mesos
>  Issue Type: Task
>  Components: build, scheduler driver
>Reporter: James Peach
>Priority: Minor
>
> The scheduler driver in {{src/sched/sched.cpp}} has some special dependencies 
> on the {{mesos-local}} code. This seems fairly hacky, but it also causes 
> binary dependencies on {{src/local/local.cpp}} to be dragged into 
> {{libmesos.so}}. {{libmesos.so}} would not otherwise require this code, which 
> could be isolated in the {{mesos-local}} command.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9041) Break agent dependencies out of libmesos.

2018-07-04 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532713#comment-16532713
 ] 

James Peach commented on MESOS-9041:


I got a rough prototype working and it does improve build times a little. I 
tested on a local VM (4CPU, 8G RAM), with a fully populated cache and {{make 
-j4}}.

Unmodified build:
{noformat}
real9m23.702s
user8m13.996s
sys 3m32.028s
{noformat}

Agent dependencies broken into libmesos-agent.so:
{noformat}
real8m4.517s
user7m23.865s
sys 3m47.629s
{noformat}

So this looks like a nice improvement in at least one configuration.

> Break agent dependencies out of libmesos.
> -
>
> Key: MESOS-9041
> URL: https://issues.apache.org/jira/browse/MESOS-9041
> Project: Mesos
>  Issue Type: Task
>  Components: agent, build
>Reporter: James Peach
>Priority: Major
>
> {{libmesos.so}} includes all the dependencies for both the master and the 
> agent. This means that is has way more symbols than necessary (causing 
> inflated built times), and drags in dependencies (e.g. libnl.so, libblkid.so) 
> that are only necessary on the agent. We should attempt to separate the agent 
> code out of {{libmesos.so}}, which would improve the build cleanliness and 
> hopefully performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9051) Move agent call validation into common validation library.

2018-07-04 Thread James Peach (JIRA)
James Peach created MESOS-9051:
--

 Summary: Move agent call validation into common validation library.
 Key: MESOS-9051
 URL: https://issues.apache.org/jira/browse/MESOS-9051
 Project: Mesos
  Issue Type: Bug
  Components: agent, build
Reporter: James Peach


The executor driver calls {{executor::call::validate()}} from 
{{src/slave/validation.cpp}}, which creates an upward dependency from 
libmesos.so (where the executor driver has to live) to the agent. If we can 
move the validation calls down to the common validation library, we can break 
this dependency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9040) Break scheduler driver dependency on mesos-local.

2018-07-03 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530919#comment-16530919
 ] 

James Peach edited comment on MESOS-9040 at 7/3/18 7:25 AM:


{quote}
It is a convenience thing meant for framework developers - maybe we can achieve 
the same by exec'ing the mesos-local runnable if desired.
{quote}

Hmm, I never knew that. Our framework developers certainly don't know about it 
either. Do you know of anyone who does use it? Is there anything I can run to 
experiment with it?

If framework developers wanted to use {{mesos-local}}, why wouldn't they just 
exec the `mesos-local` process in their CI? 


was (Author: jamespeach):
{quote}
It is a convenience thing meant for framework developers - maybe we can achieve 
the same by exec'ing the mesos-local runnable if desired.
{quote}

Hmm, I never knew that. Our framework developers certainly don't know about it 
either. Do you know of anyone who does use it? Is there anything I can run to 
experiment with it?

If framework developers wanted to use {{mesos-local}}, why wouldn't they just 
exec the `mesos-local` process i their CI? 

> Break scheduler driver dependency on mesos-local.
> -
>
> Key: MESOS-9040
> URL: https://issues.apache.org/jira/browse/MESOS-9040
> Project: Mesos
>  Issue Type: Task
>  Components: build, scheduler driver
>Reporter: James Peach
>Priority: Minor
>
> The scheduler driver in {{src/sched/sched.cpp}} has some special dependencies 
> on the {{mesos-local}} code. This seems fairly hacky, but it also causes 
> binary dependencies on {{src/local/local.cpp}} to be dragged into 
> {{libmesos.so}}. {{libmesos.so}} would not otherwise require this code, which 
> could be isolated in the {{mesos-local}} command.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9040) Break scheduler driver dependency on mesos-local.

2018-07-03 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530919#comment-16530919
 ] 

James Peach commented on MESOS-9040:


{quote}
It is a convenience thing meant for framework developers - maybe we can achieve 
the same by exec'ing the mesos-local runnable if desired.
{quote}

Hmm, I never knew that. Our framework developers certainly don't know about it 
either. Do you know of anyone who does use it? Is there anything I can run to 
experiment with it?

If framework developers wanted to use {{mesos-local}}, why wouldn't they just 
exec the `mesos-local` process i their CI? 

> Break scheduler driver dependency on mesos-local.
> -
>
> Key: MESOS-9040
> URL: https://issues.apache.org/jira/browse/MESOS-9040
> Project: Mesos
>  Issue Type: Task
>  Components: build, scheduler driver
>Reporter: James Peach
>Priority: Minor
>
> The scheduler driver in {{src/sched/sched.cpp}} has some special dependencies 
> on the {{mesos-local}} code. This seems fairly hacky, but it also causes 
> binary dependencies on {{src/local/local.cpp}} to be dragged into 
> {{libmesos.so}}. {{libmesos.so}} would not otherwise require this code, which 
> could be isolated in the {{mesos-local}} command.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9043) Move check validators to the common validation library.

2018-07-01 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529039#comment-16529039
 ] 

James Peach edited comment on MESOS-9043 at 7/1/18 10:07 AM:
-

|[r/67794|https://reviews.apache.org/r/67794/]|Moved `validation::healthCheck` 
to common code.|
|[r/67795|https://reviews.apache.org/r/67795/]|Moved `CheckInfo` validation to 
common code.|


was (Author: jamespeach):
|[r/67794|https://reviews.apache.org/r/67794/]|Moved `validation::healthCheck` 
to common code.|
|[/r/67795|https://reviews.apache.org/r/67795/]|Moved `CheckInfo` validation to 
common code.|

> Move check validators to the common validation library.
> ---
>
> Key: MESOS-9043
> URL: https://issues.apache.org/jira/browse/MESOS-9043
> Project: Mesos
>  Issue Type: Task
>  Components: build
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> The {{src/checks}} library contains some protobuf validation APIs that are 
> also used by the master. This creates a build dependency where the master 
> depends on the checks library but doesn't actually use the checks. We can 
> break this dependency by pushing the validators down into 
> {{src/common/validation.cpp}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9043) Move check validators to the common validation library.

2018-07-01 Thread James Peach (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-9043:
--

Assignee: James Peach

> Move check validators to the common validation library.
> ---
>
> Key: MESOS-9043
> URL: https://issues.apache.org/jira/browse/MESOS-9043
> Project: Mesos
>  Issue Type: Task
>  Components: build
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> The {{src/checks}} library contains some protobuf validation APIs that are 
> also used by the master. This creates a build dependency where the master 
> depends on the checks library but doesn't actually use the checks. We can 
> break this dependency by pushing the validators down into 
> {{src/common/validation.cpp}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9043) Move check validators to the common validation library.

2018-06-30 Thread James Peach (JIRA)
James Peach created MESOS-9043:
--

 Summary: Move check validators to the common validation library.
 Key: MESOS-9043
 URL: https://issues.apache.org/jira/browse/MESOS-9043
 Project: Mesos
  Issue Type: Task
  Components: build
Reporter: James Peach


The {{src/checks}} library contains some protobuf validation APIs that are also 
used by the master. This creates a build dependency where the master depends on 
the checks library but doesn't actually use the checks. We can break this 
dependency by pushing the validators down into {{src/common/validation.cpp}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   4   5   6   7   8   >