[jira] [Comment Edited] (MESOS-8809) Add functions for manipulating POSIX ACLs into stout

2018-04-26 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454262#comment-16454262
 ] 

Qian Zhang edited comment on MESOS-8809 at 4/27/18 1:36 AM:


RR: https://reviews.apache.org/r/66840/


was (Author: qianzhang):
RR: https://reviews.apache.org/r/66811/

> Add functions for manipulating POSIX ACLs into stout
> 
>
> Key: MESOS-8809
> URL: https://issues.apache.org/jira/browse/MESOS-8809
> Project: Mesos
>  Issue Type: Task
>  Components: stout
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>
> We need to add functions for setting/getting POSIX ACLs into stout so that we 
> can leverage these functions to grant volume permissions to the specific task 
> user.
> This will introduce a new dependency {{libacl-devel}} when building Mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8834) libprocess底层internal::send和internal::_send相互调用, 当outgoing[socket]里一直有数据包要发送时,那么存在栈耗尽 core dump问题

2018-04-26 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16455602#comment-16455602
 ] 

Qian Zhang commented on MESOS-8834:
---

[~bennoe] You are right, it is same as  MESOS-8594, so I have marked this one 
as duplicated.

[~general] Thanks for creating this ticket, please use English in JIRA so that 
others can better understand the issue :)

> libprocess底层internal::send和internal::_send相互调用, 
> 当outgoing[socket]里一直有数据包要发送时,那么存在栈耗尽 core dump问题
> 
>
> Key: MESOS-8834
> URL: https://issues.apache.org/jira/browse/MESOS-8834
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.5.0
>Reporter: liwuqi
>Priority: Blocker
>  Labels: core, libprocess, send
>
> 如果某个process 
> while(true)发消息,将导致大量消息缓存在outgoing[socket]里,而在底层由internal::send和internal::_send去执行消息的发送,那么就会出现递归调用:
> _send -> send -> _send ->send -> ... ->_send -> send -> 
> 导致调用栈不断增加,最终栈耗尽发生core dump问题.
> 我本地测试,发现当栈层次达到40,000+时发生core dump
> 为了解决这个问题,需要修改底层消息发送机制
>  
> 请关注这个问题,谢谢
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8851) Introduce a push-based gauge.

2018-04-26 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-8851:
--

 Summary: Introduce a push-based gauge.
 Key: MESOS-8851
 URL: https://issues.apache.org/jira/browse/MESOS-8851
 Project: Mesos
  Issue Type: Improvement
  Components: libprocess
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler


Currently, we only have pull-based gauges which have significant performance 
downsides.

A push-based gauge differs from a pull-based gauge in that the client is 
responsible for pushing the latest value into the gauge whenever it changes. 
This can be challenging in some cases as it requires the client to have a good 
handle on when the gauge value changes (rather than just computing the current 
value when asked).

It is highly recommended to use push-based gauges if possible as they provide 
significant performance benefits over pull-based gauges. Pull-based gauge 
suffer from delays getting processed on the event queue of a Process, as well 
as incur computation cost on the Process each time the metrics are collected. 
Push-based gauges, on the other hand, incur no cost to the owning Process when 
metrics are collected, and instead incur a trivial cost when the Process pushes 
new values in.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8257) Unified Containerizer "leaks" a target container mount path to the host FS when the target resolves to an absolute path

2018-04-26 Thread Jason Lai (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454867#comment-16454867
 ] 

Jason Lai commented on MESOS-8257:
--

[~alexr]: so far we have the following patches in review:
* https://reviews.apache.org/r/65811/
* https://reviews.apache.org/r/65812/
* https://reviews.apache.org/r/65898/
* https://reviews.apache.org/r/65899/
* https://reviews.apache.org/r/65900/

I'll have more patches coming up soon

> Unified Containerizer "leaks" a target container mount path to the host FS 
> when the target resolves to an absolute path
> ---
>
> Key: MESOS-8257
> URL: https://issues.apache.org/jira/browse/MESOS-8257
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Jason Lai
>Assignee: Jason Lai
>Priority: Critical
>  Labels: bug, containerizer, mountpath
>
> If a target path under the root FS provisioned from an image resolves to an 
> absolute path, it will not appear in the container root FS after 
> {{pivot_root(2)}} is called.
> A typical example is that when the target path is under {{/var/run}} (e.g. 
> {{/var/run/some-dir}}), which is usually a symlink to an absolute path of 
> {{/run}} in Debian images, the target path will get resolved as and created 
> at {{/run/some-dir}} in the host root FS, after the container root FS gets 
> provisioned. The target path will get unmounted after {{pivot_root(2)}} as it 
> is part of the old root (host FS).
> A workaround is to use {{/run}} instead of {{/var/run}}, but absolute 
> symlinks need to be resolved within the scope of the container root FS path.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8830) Agent gc on old slave sandboxes could empty persistent volume data

2018-04-26 Thread Chun-Hung Hsiao (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454853#comment-16454853
 ] 

Chun-Hung Hsiao commented on MESOS-8830:


How do you restart the agent as a new one? Did you just remove the {{latest}} 
symlink in the meta dir, or did you remove the runtime dir as well?

When an agent is restarted as a new one, it goes through the runtime dir to 
discover existing containers, and check if there is a matching record in its 
checkpoint in the meta dir. In your case, since the agent is a new one, there 
will be no record at all, so all running containers discovered in the runtime 
dir will be considered as orphans, and the containerizer will destroy them and 
clean them up, which includes running cleanup for each isolator.

I'm suspecting that for some reason the containers occured in the log were 
still running and were not treated as orphaned containers. Could you verify if 
this is the case? You could look at the agent log and see if they have been 
cleaned up as orphaned containers during recovery.

> Agent gc on old slave sandboxes could empty persistent volume data
> --
>
> Key: MESOS-8830
> URL: https://issues.apache.org/jira/browse/MESOS-8830
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.3.1
>Reporter: Zhitao Li
>Priority: Blocker
>
> We had an issue in which custom Cassandra executors (which does not use any 
> container image thus running on host filesystem) saw its persistent volume 
> data got wiped out.
> Upon revisiting logs, we found following suspicious lines:
> {panel:title=log}
> I0424 02:06:11.716380 10980 slave.cpp:5723] Current disk usage 21.93%. Max 
> allowed age: 4.764742265646493days
> I0424 02:06:11.716883 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429704593days
> I0424 02:06:11.716943 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429587852days
> I0424 02:06:11.717183 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:11.727033 10994 gc.cpp:146] Deleted 
> '/var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44'
> I0424 02:06:11.727094 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:14.933104 10972 http.cpp:1115] HTTP GET for /slave(1)/state from 
> 127.0.0.1:53602 with User-Agent='Go-http-client/1.1'
> E0424 02:06:15.245652 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume:
>  Device or resource busy
> E0424 02:06:15.394328 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149:
>  Directory not empty
> E0424 02:06:15.394419 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs:
>  Directory not empty
> E0424 02:06:15.394459 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a:
>  Directory not empty
> E0424 02:06:15.394477 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors:
>  Directory not empty
> E0424 02:06:15.394511 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004:
>  Directory not empty
> E0424 02:06:15.394536 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks:
>  Directory not empty
> E0424 02:06:15.394556 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/

[jira] [Assigned] (MESOS-8849) Per Framework resource allocation metrics

2018-04-26 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-8849:


Assignee: Greg Mann

> Per Framework resource allocation metrics
> -
>
> Key: MESOS-8849
> URL: https://issues.apache.org/jira/browse/MESOS-8849
> Project: Mesos
>  Issue Type: Task
>Reporter: Vinod Kone
>Assignee: Greg Mann
>Priority: Major
>
> These allocation related metrics (e..g, # cpus allocated or offered, 
> allocation position, # times resources were filtered etc) on a per framework 
> basis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8847) Per Framework task state metrics

2018-04-26 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-8847:


Assignee: Greg Mann

> Per Framework task state metrics
> 
>
> Key: MESOS-8847
> URL: https://issues.apache.org/jira/browse/MESOS-8847
> Project: Mesos
>  Issue Type: Task
>Reporter: Vinod Kone
>Assignee: Greg Mann
>Priority: Major
>
> Gauge metrics about current number of tasks in active states (RUNNING, 
> STAGING etc).
>  
> Counter metriss about number of tasks that reached terminal states (FINISHED, 
> FAILED etc.)
> These counter metrics will have granularity of task states and reasons (i.e., 
> number of tasks that are FINISHED due to REASON `foo` from SOURCE `master`).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8850) Race between master and allocator when destroying shared volume could lead to sorter check failure.

2018-04-26 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-8850:
---

 Summary: Race between master and allocator when destroying shared 
volume could lead to sorter check failure.
 Key: MESOS-8850
 URL: https://issues.apache.org/jira/browse/MESOS-8850
 Project: Mesos
  Issue Type: Bug
  Components: allocation, master
Reporter: Meng Zhu


When destroying shared volume, master first rescinds offers that contain the 
shared volume and then apply the destroy operation. This process involves 
interaction between the master and allocator actor. The following race could 
arise:

1. Framework1 and framework2 are each offered a shared disk;
2. Framework2 asks the master to destroy the shared disk;
3. Master rescinds framework1's offer that contains the shared disk;
4. `allocator->recoverResources` is called to recover framework1’s offered 
resources in the allocator;
5. [Race] Allocator shortly allocates resources to framework1. The allocation 
contains the shared disk that just got recovered which has not been destroyed 
at the moment. Allocator invokes `offerCallback` which dispatches to the master;
6. Master continues the destroy operation and calls 
`allocator->updateAllocation` to notify the allocator to transform the shared 
disk to regular reserved disk;
7. Master processes the `offerCallback` dispatched in step5 and offered the 
shared disk to framework1.

At this point, the same disk resource appears in two different places: one 
shared offered to framework1, one not shared currently hold by framework2 (soon 
to be recovered).

One aftermath is that:
Framework2’s resources get recovered which includes the (now regular reserved) 
disk resource.
Later, when recovering framework1’s resources which contains the shared disk, 
the sorter finds that allocated resources on the agent do not contain that 
shared disk (because in step 5 when offering the shared disk, the allocator did 
not increase the total allocated resources as framework2 was also holding the 
shared disk. We only add shared resource to allocated only when it is allocated 
the first time).

This will lead to check failure in sorter:
https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L480

Moving offer management to the allocator could definitely eliminate this race. 
Without that, we will need to add extra synchronizations. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8849) Per Framework resource allocation metrics

2018-04-26 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-8849:
-

 Summary: Per Framework resource allocation metrics
 Key: MESOS-8849
 URL: https://issues.apache.org/jira/browse/MESOS-8849
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone


These allocation related metrics (e..g, # cpus allocated or offered, allocation 
position, # times resources were filtered etc) on a per framework basis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8842) Per Framework Metrics on Master

2018-04-26 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454621#comment-16454621
 ] 

Vinod Kone commented on MESOS-8842:
---

Doc describing the structure and types of metrics that will be added.

 

https://docs.google.com/document/d/14aDm85SKMCX6RMJs0o1hRKhU2rABr4mnHIMNKfNdzuk/edit#

> Per Framework Metrics on Master
> ---
>
> Key: MESOS-8842
> URL: https://issues.apache.org/jira/browse/MESOS-8842
> Project: Mesos
>  Issue Type: Epic
>  Components: master
>Reporter: Vinod Kone
>Priority: Critical
>
> Currently, the metrics exposed by the Mesos master are cluster wide metrics. 
> It would be great to have some metrics on a per framework basis to help with 
> scalability testing, debugging, fine grained monitoring etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8848) Per Framework Offer metrics

2018-04-26 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-8848:
-

 Summary: Per Framework Offer metrics
 Key: MESOS-8848
 URL: https://issues.apache.org/jira/browse/MESOS-8848
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone


Metrics regarding number of offers (sent, accepted, declined, rescinded) on a 
per framework basis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8847) Per Framework task state metrics

2018-04-26 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-8847:
-

 Summary: Per Framework task state metrics
 Key: MESOS-8847
 URL: https://issues.apache.org/jira/browse/MESOS-8847
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone


Gauge metrics about current number of tasks in active states (RUNNING, STAGING 
etc).

 

Counter metriss about number of tasks that reached terminal states (FINISHED, 
FAILED etc.)

These counter metrics will have granularity of task states and reasons (i.e., 
number of tasks that are FINISHED due to REASON `foo` from SOURCE `master`).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8846) Per Framework state metrics

2018-04-26 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-8846:
-

 Summary: Per Framework state metrics
 Key: MESOS-8846
 URL: https://issues.apache.org/jira/browse/MESOS-8846
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone


Metrics about framework state (e.g., subscribed, suppressed etc).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8845) Per Framework Operation metrics

2018-04-26 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-8845:
-

 Summary: Per Framework Operation metrics
 Key: MESOS-8845
 URL: https://issues.apache.org/jira/browse/MESOS-8845
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone


Metris for number of operations sent via ACCEPT calls by framework.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8844) Per Framework EVENT metrics

2018-04-26 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-8844:
-

 Summary: Per Framework EVENT metrics
 Key: MESOS-8844
 URL: https://issues.apache.org/jira/browse/MESOS-8844
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone


Metrics for number of events sent by the master to the framework.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8843) Per Framework CALL metrics

2018-04-26 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-8843:
-

 Summary: Per Framework CALL metrics
 Key: MESOS-8843
 URL: https://issues.apache.org/jira/browse/MESOS-8843
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone


Metrics about number of different kinds of calls sent by a framework to master.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8842) Per Framework Metrics on Master

2018-04-26 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-8842:
-

 Summary: Per Framework Metrics on Master
 Key: MESOS-8842
 URL: https://issues.apache.org/jira/browse/MESOS-8842
 Project: Mesos
  Issue Type: Epic
  Components: master
Reporter: Vinod Kone


Currently, the metrics exposed by the Mesos master are cluster wide metrics. It 
would be great to have some metrics on a per framework basis to help with 
scalability testing, debugging, fine grained monitoring etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8734) Restore `WaitAfterDestroy` test to check termination status of a terminated nested container.

2018-04-26 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-8734:


Assignee: Andrei Budnik

> Restore `WaitAfterDestroy` test to check termination status of a terminated 
> nested container.
> -
>
> Key: MESOS-8734
> URL: https://issues.apache.org/jira/browse/MESOS-8734
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: mesosphere, test
>
> It's important to check that after termination of a nested container, its 
> termination status is available. This property is used in default executor.
> Note that the test uses Mesos c'zer and checks above-mentioned property only 
> for Mesos c'zer.
> Right now, if we remove [this section of 
> code|https://github.com/apache/mesos/blob/5b655ce062ff55cdefed119d97ad923aeeb2efb5/src/slave/containerizer/mesos/containerizer.cpp#L2093-L2111],
>  no test will be broken!
> https://reviews.apache.org/r/65505



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8687) Check failure in `ProcessBase::_consume()`.

2018-04-26 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454395#comment-16454395
 ] 

Benno Evers commented on MESOS-8687:


Review for the test fix: https://reviews.apache.org/r/66799/

> Check failure in `ProcessBase::_consume()`.
> ---
>
> Key: MESOS-8687
> URL: https://issues.apache.org/jira/browse/MESOS-8687
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.6.0
> Environment: ec2 CentOS 7 with SSL
>Reporter: Alexander Rukletsov
>Assignee: Benno Evers
>Priority: Major
>  Labels: flaky-test, reliability
> Attachments: MasterAPITest.MasterFailover-with-CHECK.txt, 
> MasterFailover-badrun.txt
>
>
> Observed a segfault in the {{MasterAPITest.MasterFailover}} test:
> {noformat}
> 10:59:04 I0319 10:59:04.312197  3274 master.cpp:649] Authorization enabled
> 10:59:04 F0319 10:59:04.312772  3274 owned.hpp:110] Check failed: 'get()' 
> Must be non NULL
> 10:59:04 *** Check failure stack trace: ***
> 10:59:04 I0319 10:59:04.313470  3279 hierarchical.cpp:175] Initialized 
> hierarchical allocator process
> 10:59:04 I0319 10:59:04.313500  3279 whitelist_watcher.cpp:77] No whitelist 
> given
> 10:59:04 @ 0x7fe82d44e0cd  google::LogMessage::Fail()
> 10:59:04 @ 0x7fe82d44ff1d  google::LogMessage::SendToLog()
> 10:59:04 @ 0x7fe82d44dcb3  google::LogMessage::Flush()
> 10:59:04 @ 0x7fe82d450919  google::LogMessageFatal::~LogMessageFatal()
> 10:59:04 @ 0x7fe82d3cee16  google::CheckNotNull<>()
> 10:59:04 @ 0x7fe82d3b4253  process::ProcessBase::_consume()
> 10:59:04 @ 0x7fe82d3b4a66  
> _ZNO6lambda12CallableOnceIFN7process6FutureINS1_4http8ResponseEEEvEE10CallableFnINS_8internal7PartialIZNS1_11ProcessBase7consumeEONS1_9HttpEventEEUlRKNS1_5OwnedINS3_7Request_JSG_clEv
> 10:59:04 @ 0x7fe82c39c3ca  
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureINS1_4http8ResponseclINS0_IFSE_vESE_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseISD_EESt14default_deleteISQ_EEOSI_S3_E_JST_SI_St12_PlaceholderILi1EEclEOS3_
> 10:59:04 @ 0x7fe82d39f2c1  process::ProcessBase::consume()
> 10:59:04 @ 0x7fe82d3b84da  process::ProcessManager::resume()
> 10:59:04 @ 0x7fe82d3bbf56  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> 10:59:04 @ 0x7fe82d577870  execute_native_thread_routine
> 10:59:04 @ 0x7fe82a761e25  start_thread
> 10:59:04 @ 0x7fe82986334d  __clone
> {noformat}
> Full test log is attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8797) Check failed in the default executor while running `MesosContainerizer/DefaultExecutorTest.TaskUsesExecutor/0` test.

2018-04-26 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454390#comment-16454390
 ] 

Benno Evers commented on MESOS-8797:


https://reviews.apache.org/r/66815/

> Check failed in the default executor while running 
> `MesosContainerizer/DefaultExecutorTest.TaskUsesExecutor/0` test.
> 
>
> Key: MESOS-8797
> URL: https://issues.apache.org/jira/browse/MESOS-8797
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
> Environment: Centos 7 SSL (internal CI)
> master-[a95d9b8|https://github.com/apache/mesos/commit/a95d9b8fb53ab8fbf4a7b6d762c9e0749b4c013a]
>  (17-Apr-2018 14:03:14)
>Reporter: Andrei Budnik
>Priority: Major
>  Labels: flaky, flaky-test
> Attachments: DefaultExecutorTest.TaskUsesExecutor-badrun.txt
>
>
> {code:java}
> lt-mesos-default-executor: ../../3rdparty/stout/include/stout/option.hpp:119: 
> T& Option::get() & [with T = std::basic_string]: Assertion 
> `isSome()' failed.
> *** Aborted at 1523976443 (unix time) try "date -d @1523976443" if you are 
> using GNU date ***
> PC: @ 0x7efcfd11f1f7 __GI_raise
> *** SIGABRT (@0x4d44) received by PID 19780 (TID 0x7efcf5adb700) from PID 
> 19780; stack trace: ***
> @ 0x7efcfd9da5e0 (unknown)
> @ 0x7efcfd11f1f7 __GI_raise
> @ 0x7efcfd1208e8 __GI_abort
> @ 0x7efcfd118266 __assert_fail_base
> @ 0x7efcfd118312 __GI___assert_fail
> @ 0x55a05fa269f7 mesos::internal::DefaultExecutor::waited()
> @ 0x7efd002212d1 process::ProcessBase::consume()
> @ 0x7efd0023a52a process::ProcessManager::resume()
> @ 0x7efd0023dfa6 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> @ 0x7efd003f9470 execute_native_thread_routine
> @ 0x7efcfd9d2e25 start_thread
> @ 0x7efcfd1e234d __clone
> {code}
> Observed this failure in internal CI for test
> {code:java}
>  MesosContainerizer/DefaultExecutorTest.TaskUsesExecutor/0{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8809) Add functions for manipulating POSIX ACLs into stout

2018-04-26 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454262#comment-16454262
 ] 

Qian Zhang commented on MESOS-8809:
---

RR: https://reviews.apache.org/r/66811/

> Add functions for manipulating POSIX ACLs into stout
> 
>
> Key: MESOS-8809
> URL: https://issues.apache.org/jira/browse/MESOS-8809
> Project: Mesos
>  Issue Type: Task
>  Components: stout
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>
> We need to add functions for setting/getting POSIX ACLs into stout so that we 
> can leverage these functions to grant volume permissions to the specific task 
> user.
> This will introduce a new dependency {{libacl-devel}} when building Mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8834) libprocess底层internal::send和internal::_send相互调用, 当outgoing[socket]里一直有数据包要发送时,那么存在栈耗尽 core dump问题

2018-04-26 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454219#comment-16454219
 ] 

Benno Evers commented on MESOS-8834:


While I can't really understand the text, judging from the send -> _send -> 
send -> ... -> coredump sequence this looks like it might be the same issue as 
MESOS-8594?

> libprocess底层internal::send和internal::_send相互调用, 
> 当outgoing[socket]里一直有数据包要发送时,那么存在栈耗尽 core dump问题
> 
>
> Key: MESOS-8834
> URL: https://issues.apache.org/jira/browse/MESOS-8834
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.5.0
>Reporter: liwuqi
>Priority: Blocker
>  Labels: core, libprocess, send
>
> 如果某个process 
> while(true)发消息,将导致大量消息缓存在outgoing[socket]里,而在底层由internal::send和internal::_send去执行消息的发送,那么就会出现递归调用:
> _send -> send -> _send ->send -> ... ->_send -> send -> 
> 导致调用栈不断增加,最终栈耗尽发生core dump问题.
> 我本地测试,发现当栈层次达到40,000+时发生core dump
> 为了解决这个问题,需要修改底层消息发送机制
>  
> 请关注这个问题,谢谢
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8841) Flaky `MasterAllocatorTest/0.SingleFramework`

2018-04-26 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-8841:


 Summary: Flaky `MasterAllocatorTest/0.SingleFramework`
 Key: MESOS-8841
 URL: https://issues.apache.org/jira/browse/MESOS-8841
 Project: Mesos
  Issue Type: Bug
  Components: allocation, master
 Environment: Fedora 25
master/a1c6a7a3c5
Reporter: Andrei Budnik


 
{code:java}
[ RUN ] MasterAllocatorTest/0.SingleFramework
F0426 08:31:29.775804 9701 hierarchical.cpp:586] Check failed: 
slaves.contains(slaveId)
*** Check failure stack trace: ***
@ 0x7f365e108fb8 google::LogMessage::Fail()
@ 0x7f365e108f15 google::LogMessage::SendToLog()
@ 0x7f365e10890f google::LogMessage::Flush()
@ 0x7f365e10b6d2 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f365c63b8d7 
mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeSlave()
@ 0x55728a500ac7 
_ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_7SlaveIDES8_EEvRKNS_3PIDIT_EEMSA_FvT0_EOT1_ENKUlOS6_PNS_11ProcessBaseEE_clESJ_SL_
@ 0x55728a589908 
_ZN5cpp176invokeIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS3_7SlaveIDESA_EEvRKNS1_3PIDIT_EEMSC_FvT0_EOT1_EUlOS8_PNS1_11ProcessBaseEE_JS8_SN_EEEDTclcl7forwardISC_Efp_Espcl7forwardIT0_Efp0_EEEOSC_DpOSP_
@ 0x55728a586a0f 
_ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS4_7SlaveIDESB_EEvRKNS2_3PIDIT_EEMSD_FvT0_EOT1_EUlOS9_PNS2_11ProcessBaseEE_JS9_St12_PlaceholderILi113invoke_expandISP_St5tupleIJS9_SR_EESU_IJOSO_EEJLm0ELm1DTcl6invokecl7forwardISD_Efp_Espcl6expandcl3getIXT2_EEcl7forwardISH_Efp0_EEcl7forwardISK_Efp2_OSD_OSH_N5cpp1416integer_sequenceImJXspT2_SL_
@ 0x55728a5852b0 
_ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS4_7SlaveIDESB_EEvRKNS2_3PIDIT_EEMSD_FvT0_EOT1_EUlOS9_PNS2_11ProcessBaseEE_JS9_St12_PlaceholderILi1clIJSO_EEEDTcl13invoke_expandcl4movedtdefpT1fEcl4movedtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1_Ecl16forward_as_tuplespcl7forwardIT_Efp_DpOSX_
@ 0x55728a584209 
_ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS6_7SlaveIDESD_EEvRKNS4_3PIDIT_EEMSF_FvT0_EOT1_EUlOSB_PNS4_11ProcessBaseEE_JSB_St12_PlaceholderILi1EJSQ_EEEDTclcl7forwardISF_Efp_Espcl7forwardIT0_Efp0_EEEOSF_DpOSV_
@ 0x55728a583995 
_ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS7_7SlaveIDESE_EEvRKNS5_3PIDIT_EEMSG_FvT0_EOT1_EUlOSC_PNS5_11ProcessBaseEE_JSC_St12_PlaceholderILi1EJSR_EEEvOSG_DpOT0_
@ 0x55728a581522 
_ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNSA_7SlaveIDESH_EEvRKNS1_3PIDIT_EEMSJ_FvT0_EOT1_EUlOSF_S3_E_JSF_St12_PlaceholderILi1EEclEOS3_
@ 0x7f365e0484c0 _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_
@ 0x7f365e025760 process::ProcessBase::consume()
@ 0x7f365e033abc _ZNO7process13DispatchEvent7consumeEPNS_13EventConsumerE
@ 0x55728a1cb6ea process::ProcessBase::serve()
@ 0x7f365e0225ed process::ProcessManager::resume()
@ 0x7f365e01e94c _ZZN7process14ProcessManager12init_threadsEvENKUlvE_clEv
@ 0x7f365e031080 
_ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
@ 0x7f365e030a34 
_ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEclEv
@ 0x7f365e030338 
_ZNSt6thread11_State_implISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
@ 0x7f365478976f (unknown)
@ 0x7f3654e6973a start_thread
@ 0x7f3653eefe7f __GI___clone{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7944) Implement jemalloc memory profiling support for Mesos

2018-04-26 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453831#comment-16453831
 ] 

Alexander Rukletsov commented on MESOS-7944:


{noformat}
commit aa65947286d9115d1bdd34d7b7f0f0038e128345
Author: Benno Evers bev...@mesosphere.com
AuthorDate: Thu Apr 26 12:01:26 2018 +0200
Commit: Alexander Rukletsov al...@apache.org
CommitDate: Thu Apr 26 12:45:02 2018 +0200

Added documentation for memory profiling.

Review: https://reviews.apache.org/r/63372/
{noformat}

> Implement jemalloc memory profiling support for Mesos
> -
>
> Key: MESOS-7944
> URL: https://issues.apache.org/jira/browse/MESOS-7944
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Major
>  Labels: mesosphere
> Fix For: 1.6.0
>
>
> After investigation in MESOS-7876 and discussion on the mailing list, this 
> task is for tracking progress on adding out-of-the-box memory profiling 
> support using jemalloc to Mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7854) Authorize resource calls to provider manager api

2018-04-26 Thread Jan Schlicht (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453760#comment-16453760
 ] 

Jan Schlicht commented on MESOS-7854:
-

Closing this in favor of MESOS-8774, as that ticket is more specific.

> Authorize resource calls to provider manager api
> 
>
> Key: MESOS-7854
> URL: https://issues.apache.org/jira/browse/MESOS-7854
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Bannier
>Priority: Critical
>  Labels: csi-post-mvp, mesosphere, storage
>
> The resource provider manager provides a function
> {code}
> process::Future api(
> const process::http::Request& request,
> const Option& principal) const;
> {code}
> which is exposed e.g., as an agent endpoint.
> We need to add authorization to this function in order to e.g., stop rough 
> callers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8774) Authenticate and authorize calls to the resource provider manager's API

2018-04-26 Thread Jan Schlicht (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht reassigned MESOS-8774:
---

Assignee: Jan Schlicht

> Authenticate and authorize calls to the resource provider manager's API 
> 
>
> Key: MESOS-8774
> URL: https://issues.apache.org/jira/browse/MESOS-8774
> Project: Mesos
>  Issue Type: Task
>  Components: agent
>Reporter: Benjamin Bannier
>Assignee: Jan Schlicht
>Priority: Major
>  Labels: mesosphere
>
> The resource provider manager is exposed via an agent endpoint against which 
> resource providers subscribe or perform other actions. We should authenticate 
> and authorize any interactions there.
> Since currently local resource providers run on agents who manages their 
> lifetime it seems natural to extend the framework used for executor 
> authentication to resource providers as well. The agent would then generate a 
> secret token whenever a new resource provider is started and inject it into 
> the resource providers it launches. Resource providers in turn would use this 
> token when interacting with the manager API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)