[jira] [Comment Edited] (MESOS-8809) Add functions for manipulating POSIX ACLs into stout
[ https://issues.apache.org/jira/browse/MESOS-8809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454262#comment-16454262 ] Qian Zhang edited comment on MESOS-8809 at 4/27/18 1:36 AM: RR: https://reviews.apache.org/r/66840/ was (Author: qianzhang): RR: https://reviews.apache.org/r/66811/ > Add functions for manipulating POSIX ACLs into stout > > > Key: MESOS-8809 > URL: https://issues.apache.org/jira/browse/MESOS-8809 > Project: Mesos > Issue Type: Task > Components: stout >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > > We need to add functions for setting/getting POSIX ACLs into stout so that we > can leverage these functions to grant volume permissions to the specific task > user. > This will introduce a new dependency {{libacl-devel}} when building Mesos. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8834) libprocess底层internal::send和internal::_send相互调用, 当outgoing[socket]里一直有数据包要发送时,那么存在栈耗尽 core dump问题
[ https://issues.apache.org/jira/browse/MESOS-8834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16455602#comment-16455602 ] Qian Zhang commented on MESOS-8834: --- [~bennoe] You are right, it is same as MESOS-8594, so I have marked this one as duplicated. [~general] Thanks for creating this ticket, please use English in JIRA so that others can better understand the issue :) > libprocess底层internal::send和internal::_send相互调用, > 当outgoing[socket]里一直有数据包要发送时,那么存在栈耗尽 core dump问题 > > > Key: MESOS-8834 > URL: https://issues.apache.org/jira/browse/MESOS-8834 > Project: Mesos > Issue Type: Bug > Components: libprocess >Affects Versions: 1.5.0 >Reporter: liwuqi >Priority: Blocker > Labels: core, libprocess, send > > 如果某个process > while(true)发消息,将导致大量消息缓存在outgoing[socket]里,而在底层由internal::send和internal::_send去执行消息的发送,那么就会出现递归调用: > _send -> send -> _send ->send -> ... ->_send -> send -> > 导致调用栈不断增加,最终栈耗尽发生core dump问题. > 我本地测试,发现当栈层次达到40,000+时发生core dump > 为了解决这个问题,需要修改底层消息发送机制 > > 请关注这个问题,谢谢 > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8851) Introduce a push-based gauge.
Benjamin Mahler created MESOS-8851: -- Summary: Introduce a push-based gauge. Key: MESOS-8851 URL: https://issues.apache.org/jira/browse/MESOS-8851 Project: Mesos Issue Type: Improvement Components: libprocess Reporter: Benjamin Mahler Assignee: Benjamin Mahler Currently, we only have pull-based gauges which have significant performance downsides. A push-based gauge differs from a pull-based gauge in that the client is responsible for pushing the latest value into the gauge whenever it changes. This can be challenging in some cases as it requires the client to have a good handle on when the gauge value changes (rather than just computing the current value when asked). It is highly recommended to use push-based gauges if possible as they provide significant performance benefits over pull-based gauges. Pull-based gauge suffer from delays getting processed on the event queue of a Process, as well as incur computation cost on the Process each time the metrics are collected. Push-based gauges, on the other hand, incur no cost to the owning Process when metrics are collected, and instead incur a trivial cost when the Process pushes new values in. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8257) Unified Containerizer "leaks" a target container mount path to the host FS when the target resolves to an absolute path
[ https://issues.apache.org/jira/browse/MESOS-8257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454867#comment-16454867 ] Jason Lai commented on MESOS-8257: -- [~alexr]: so far we have the following patches in review: * https://reviews.apache.org/r/65811/ * https://reviews.apache.org/r/65812/ * https://reviews.apache.org/r/65898/ * https://reviews.apache.org/r/65899/ * https://reviews.apache.org/r/65900/ I'll have more patches coming up soon > Unified Containerizer "leaks" a target container mount path to the host FS > when the target resolves to an absolute path > --- > > Key: MESOS-8257 > URL: https://issues.apache.org/jira/browse/MESOS-8257 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 1.3.1, 1.4.1, 1.5.0 >Reporter: Jason Lai >Assignee: Jason Lai >Priority: Critical > Labels: bug, containerizer, mountpath > > If a target path under the root FS provisioned from an image resolves to an > absolute path, it will not appear in the container root FS after > {{pivot_root(2)}} is called. > A typical example is that when the target path is under {{/var/run}} (e.g. > {{/var/run/some-dir}}), which is usually a symlink to an absolute path of > {{/run}} in Debian images, the target path will get resolved as and created > at {{/run/some-dir}} in the host root FS, after the container root FS gets > provisioned. The target path will get unmounted after {{pivot_root(2)}} as it > is part of the old root (host FS). > A workaround is to use {{/run}} instead of {{/var/run}}, but absolute > symlinks need to be resolved within the scope of the container root FS path. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8830) Agent gc on old slave sandboxes could empty persistent volume data
[ https://issues.apache.org/jira/browse/MESOS-8830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454853#comment-16454853 ] Chun-Hung Hsiao commented on MESOS-8830: How do you restart the agent as a new one? Did you just remove the {{latest}} symlink in the meta dir, or did you remove the runtime dir as well? When an agent is restarted as a new one, it goes through the runtime dir to discover existing containers, and check if there is a matching record in its checkpoint in the meta dir. In your case, since the agent is a new one, there will be no record at all, so all running containers discovered in the runtime dir will be considered as orphans, and the containerizer will destroy them and clean them up, which includes running cleanup for each isolator. I'm suspecting that for some reason the containers occured in the log were still running and were not treated as orphaned containers. Could you verify if this is the case? You could look at the agent log and see if they have been cleaned up as orphaned containers during recovery. > Agent gc on old slave sandboxes could empty persistent volume data > -- > > Key: MESOS-8830 > URL: https://issues.apache.org/jira/browse/MESOS-8830 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.3.1 >Reporter: Zhitao Li >Priority: Blocker > > We had an issue in which custom Cassandra executors (which does not use any > container image thus running on host filesystem) saw its persistent volume > data got wiped out. > Upon revisiting logs, we found following suspicious lines: > {panel:title=log} > I0424 02:06:11.716380 10980 slave.cpp:5723] Current disk usage 21.93%. Max > allowed age: 4.764742265646493days > I0424 02:06:11.716883 10994 gc.cpp:170] Pruning directories with remaining > removal time 2.23508429704593days > I0424 02:06:11.716943 10994 gc.cpp:170] Pruning directories with remaining > removal time 2.23508429587852days > I0424 02:06:11.717183 10994 gc.cpp:133] Deleting > /var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44 > I0424 02:06:11.727033 10994 gc.cpp:146] Deleted > '/var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44' > I0424 02:06:11.727094 10994 gc.cpp:133] Deleting > /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44 > I0424 02:06:14.933104 10972 http.cpp:1115] HTTP GET for /slave(1)/state from > 127.0.0.1:53602 with User-Agent='Go-http-client/1.1' > E0424 02:06:15.245652 10994 rmdir.hpp:81] Failed to delete directory > /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume: > Device or resource busy > E0424 02:06:15.394328 10994 rmdir.hpp:81] Failed to delete directory > /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149: > Directory not empty > E0424 02:06:15.394419 10994 rmdir.hpp:81] Failed to delete directory > /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs: > Directory not empty > E0424 02:06:15.394459 10994 rmdir.hpp:81] Failed to delete directory > /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a: > Directory not empty > E0424 02:06:15.394477 10994 rmdir.hpp:81] Failed to delete directory > /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors: > Directory not empty > E0424 02:06:15.394511 10994 rmdir.hpp:81] Failed to delete directory > /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004: > Directory not empty > E0424 02:06:15.394536 10994 rmdir.hpp:81] Failed to delete directory > /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks: > Directory not empty > E0424 02:06:15.394556 10994 rmdir.hpp:81] Failed to delete directory > /var/lib/mesos/slaves/
[jira] [Assigned] (MESOS-8849) Per Framework resource allocation metrics
[ https://issues.apache.org/jira/browse/MESOS-8849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-8849: Assignee: Greg Mann > Per Framework resource allocation metrics > - > > Key: MESOS-8849 > URL: https://issues.apache.org/jira/browse/MESOS-8849 > Project: Mesos > Issue Type: Task >Reporter: Vinod Kone >Assignee: Greg Mann >Priority: Major > > These allocation related metrics (e..g, # cpus allocated or offered, > allocation position, # times resources were filtered etc) on a per framework > basis. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8847) Per Framework task state metrics
[ https://issues.apache.org/jira/browse/MESOS-8847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-8847: Assignee: Greg Mann > Per Framework task state metrics > > > Key: MESOS-8847 > URL: https://issues.apache.org/jira/browse/MESOS-8847 > Project: Mesos > Issue Type: Task >Reporter: Vinod Kone >Assignee: Greg Mann >Priority: Major > > Gauge metrics about current number of tasks in active states (RUNNING, > STAGING etc). > > Counter metriss about number of tasks that reached terminal states (FINISHED, > FAILED etc.) > These counter metrics will have granularity of task states and reasons (i.e., > number of tasks that are FINISHED due to REASON `foo` from SOURCE `master`). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8850) Race between master and allocator when destroying shared volume could lead to sorter check failure.
Meng Zhu created MESOS-8850: --- Summary: Race between master and allocator when destroying shared volume could lead to sorter check failure. Key: MESOS-8850 URL: https://issues.apache.org/jira/browse/MESOS-8850 Project: Mesos Issue Type: Bug Components: allocation, master Reporter: Meng Zhu When destroying shared volume, master first rescinds offers that contain the shared volume and then apply the destroy operation. This process involves interaction between the master and allocator actor. The following race could arise: 1. Framework1 and framework2 are each offered a shared disk; 2. Framework2 asks the master to destroy the shared disk; 3. Master rescinds framework1's offer that contains the shared disk; 4. `allocator->recoverResources` is called to recover framework1’s offered resources in the allocator; 5. [Race] Allocator shortly allocates resources to framework1. The allocation contains the shared disk that just got recovered which has not been destroyed at the moment. Allocator invokes `offerCallback` which dispatches to the master; 6. Master continues the destroy operation and calls `allocator->updateAllocation` to notify the allocator to transform the shared disk to regular reserved disk; 7. Master processes the `offerCallback` dispatched in step5 and offered the shared disk to framework1. At this point, the same disk resource appears in two different places: one shared offered to framework1, one not shared currently hold by framework2 (soon to be recovered). One aftermath is that: Framework2’s resources get recovered which includes the (now regular reserved) disk resource. Later, when recovering framework1’s resources which contains the shared disk, the sorter finds that allocated resources on the agent do not contain that shared disk (because in step 5 when offering the shared disk, the allocator did not increase the total allocated resources as framework2 was also holding the shared disk. We only add shared resource to allocated only when it is allocated the first time). This will lead to check failure in sorter: https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L480 Moving offer management to the allocator could definitely eliminate this race. Without that, we will need to add extra synchronizations. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8849) Per Framework resource allocation metrics
Vinod Kone created MESOS-8849: - Summary: Per Framework resource allocation metrics Key: MESOS-8849 URL: https://issues.apache.org/jira/browse/MESOS-8849 Project: Mesos Issue Type: Task Reporter: Vinod Kone These allocation related metrics (e..g, # cpus allocated or offered, allocation position, # times resources were filtered etc) on a per framework basis. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8842) Per Framework Metrics on Master
[ https://issues.apache.org/jira/browse/MESOS-8842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454621#comment-16454621 ] Vinod Kone commented on MESOS-8842: --- Doc describing the structure and types of metrics that will be added. https://docs.google.com/document/d/14aDm85SKMCX6RMJs0o1hRKhU2rABr4mnHIMNKfNdzuk/edit# > Per Framework Metrics on Master > --- > > Key: MESOS-8842 > URL: https://issues.apache.org/jira/browse/MESOS-8842 > Project: Mesos > Issue Type: Epic > Components: master >Reporter: Vinod Kone >Priority: Critical > > Currently, the metrics exposed by the Mesos master are cluster wide metrics. > It would be great to have some metrics on a per framework basis to help with > scalability testing, debugging, fine grained monitoring etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8848) Per Framework Offer metrics
Vinod Kone created MESOS-8848: - Summary: Per Framework Offer metrics Key: MESOS-8848 URL: https://issues.apache.org/jira/browse/MESOS-8848 Project: Mesos Issue Type: Task Reporter: Vinod Kone Metrics regarding number of offers (sent, accepted, declined, rescinded) on a per framework basis. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8847) Per Framework task state metrics
Vinod Kone created MESOS-8847: - Summary: Per Framework task state metrics Key: MESOS-8847 URL: https://issues.apache.org/jira/browse/MESOS-8847 Project: Mesos Issue Type: Task Reporter: Vinod Kone Gauge metrics about current number of tasks in active states (RUNNING, STAGING etc). Counter metriss about number of tasks that reached terminal states (FINISHED, FAILED etc.) These counter metrics will have granularity of task states and reasons (i.e., number of tasks that are FINISHED due to REASON `foo` from SOURCE `master`). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8846) Per Framework state metrics
Vinod Kone created MESOS-8846: - Summary: Per Framework state metrics Key: MESOS-8846 URL: https://issues.apache.org/jira/browse/MESOS-8846 Project: Mesos Issue Type: Task Reporter: Vinod Kone Metrics about framework state (e.g., subscribed, suppressed etc). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8845) Per Framework Operation metrics
Vinod Kone created MESOS-8845: - Summary: Per Framework Operation metrics Key: MESOS-8845 URL: https://issues.apache.org/jira/browse/MESOS-8845 Project: Mesos Issue Type: Task Reporter: Vinod Kone Metris for number of operations sent via ACCEPT calls by framework. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8844) Per Framework EVENT metrics
Vinod Kone created MESOS-8844: - Summary: Per Framework EVENT metrics Key: MESOS-8844 URL: https://issues.apache.org/jira/browse/MESOS-8844 Project: Mesos Issue Type: Task Reporter: Vinod Kone Metrics for number of events sent by the master to the framework. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8843) Per Framework CALL metrics
Vinod Kone created MESOS-8843: - Summary: Per Framework CALL metrics Key: MESOS-8843 URL: https://issues.apache.org/jira/browse/MESOS-8843 Project: Mesos Issue Type: Task Reporter: Vinod Kone Metrics about number of different kinds of calls sent by a framework to master. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8842) Per Framework Metrics on Master
Vinod Kone created MESOS-8842: - Summary: Per Framework Metrics on Master Key: MESOS-8842 URL: https://issues.apache.org/jira/browse/MESOS-8842 Project: Mesos Issue Type: Epic Components: master Reporter: Vinod Kone Currently, the metrics exposed by the Mesos master are cluster wide metrics. It would be great to have some metrics on a per framework basis to help with scalability testing, debugging, fine grained monitoring etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8734) Restore `WaitAfterDestroy` test to check termination status of a terminated nested container.
[ https://issues.apache.org/jira/browse/MESOS-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrei Budnik reassigned MESOS-8734: Assignee: Andrei Budnik > Restore `WaitAfterDestroy` test to check termination status of a terminated > nested container. > - > > Key: MESOS-8734 > URL: https://issues.apache.org/jira/browse/MESOS-8734 > Project: Mesos > Issue Type: Task >Reporter: Andrei Budnik >Assignee: Andrei Budnik >Priority: Major > Labels: mesosphere, test > > It's important to check that after termination of a nested container, its > termination status is available. This property is used in default executor. > Note that the test uses Mesos c'zer and checks above-mentioned property only > for Mesos c'zer. > Right now, if we remove [this section of > code|https://github.com/apache/mesos/blob/5b655ce062ff55cdefed119d97ad923aeeb2efb5/src/slave/containerizer/mesos/containerizer.cpp#L2093-L2111], > no test will be broken! > https://reviews.apache.org/r/65505 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8687) Check failure in `ProcessBase::_consume()`.
[ https://issues.apache.org/jira/browse/MESOS-8687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454395#comment-16454395 ] Benno Evers commented on MESOS-8687: Review for the test fix: https://reviews.apache.org/r/66799/ > Check failure in `ProcessBase::_consume()`. > --- > > Key: MESOS-8687 > URL: https://issues.apache.org/jira/browse/MESOS-8687 > Project: Mesos > Issue Type: Bug > Components: libprocess >Affects Versions: 1.6.0 > Environment: ec2 CentOS 7 with SSL >Reporter: Alexander Rukletsov >Assignee: Benno Evers >Priority: Major > Labels: flaky-test, reliability > Attachments: MasterAPITest.MasterFailover-with-CHECK.txt, > MasterFailover-badrun.txt > > > Observed a segfault in the {{MasterAPITest.MasterFailover}} test: > {noformat} > 10:59:04 I0319 10:59:04.312197 3274 master.cpp:649] Authorization enabled > 10:59:04 F0319 10:59:04.312772 3274 owned.hpp:110] Check failed: 'get()' > Must be non NULL > 10:59:04 *** Check failure stack trace: *** > 10:59:04 I0319 10:59:04.313470 3279 hierarchical.cpp:175] Initialized > hierarchical allocator process > 10:59:04 I0319 10:59:04.313500 3279 whitelist_watcher.cpp:77] No whitelist > given > 10:59:04 @ 0x7fe82d44e0cd google::LogMessage::Fail() > 10:59:04 @ 0x7fe82d44ff1d google::LogMessage::SendToLog() > 10:59:04 @ 0x7fe82d44dcb3 google::LogMessage::Flush() > 10:59:04 @ 0x7fe82d450919 google::LogMessageFatal::~LogMessageFatal() > 10:59:04 @ 0x7fe82d3cee16 google::CheckNotNull<>() > 10:59:04 @ 0x7fe82d3b4253 process::ProcessBase::_consume() > 10:59:04 @ 0x7fe82d3b4a66 > _ZNO6lambda12CallableOnceIFN7process6FutureINS1_4http8ResponseEEEvEE10CallableFnINS_8internal7PartialIZNS1_11ProcessBase7consumeEONS1_9HttpEventEEUlRKNS1_5OwnedINS3_7Request_JSG_clEv > 10:59:04 @ 0x7fe82c39c3ca > _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureINS1_4http8ResponseclINS0_IFSE_vESE_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseISD_EESt14default_deleteISQ_EEOSI_S3_E_JST_SI_St12_PlaceholderILi1EEclEOS3_ > 10:59:04 @ 0x7fe82d39f2c1 process::ProcessBase::consume() > 10:59:04 @ 0x7fe82d3b84da process::ProcessManager::resume() > 10:59:04 @ 0x7fe82d3bbf56 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > 10:59:04 @ 0x7fe82d577870 execute_native_thread_routine > 10:59:04 @ 0x7fe82a761e25 start_thread > 10:59:04 @ 0x7fe82986334d __clone > {noformat} > Full test log is attached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8797) Check failed in the default executor while running `MesosContainerizer/DefaultExecutorTest.TaskUsesExecutor/0` test.
[ https://issues.apache.org/jira/browse/MESOS-8797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454390#comment-16454390 ] Benno Evers commented on MESOS-8797: https://reviews.apache.org/r/66815/ > Check failed in the default executor while running > `MesosContainerizer/DefaultExecutorTest.TaskUsesExecutor/0` test. > > > Key: MESOS-8797 > URL: https://issues.apache.org/jira/browse/MESOS-8797 > Project: Mesos > Issue Type: Bug > Components: executor > Environment: Centos 7 SSL (internal CI) > master-[a95d9b8|https://github.com/apache/mesos/commit/a95d9b8fb53ab8fbf4a7b6d762c9e0749b4c013a] > (17-Apr-2018 14:03:14) >Reporter: Andrei Budnik >Priority: Major > Labels: flaky, flaky-test > Attachments: DefaultExecutorTest.TaskUsesExecutor-badrun.txt > > > {code:java} > lt-mesos-default-executor: ../../3rdparty/stout/include/stout/option.hpp:119: > T& Option::get() & [with T = std::basic_string]: Assertion > `isSome()' failed. > *** Aborted at 1523976443 (unix time) try "date -d @1523976443" if you are > using GNU date *** > PC: @ 0x7efcfd11f1f7 __GI_raise > *** SIGABRT (@0x4d44) received by PID 19780 (TID 0x7efcf5adb700) from PID > 19780; stack trace: *** > @ 0x7efcfd9da5e0 (unknown) > @ 0x7efcfd11f1f7 __GI_raise > @ 0x7efcfd1208e8 __GI_abort > @ 0x7efcfd118266 __assert_fail_base > @ 0x7efcfd118312 __GI___assert_fail > @ 0x55a05fa269f7 mesos::internal::DefaultExecutor::waited() > @ 0x7efd002212d1 process::ProcessBase::consume() > @ 0x7efd0023a52a process::ProcessManager::resume() > @ 0x7efd0023dfa6 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > @ 0x7efd003f9470 execute_native_thread_routine > @ 0x7efcfd9d2e25 start_thread > @ 0x7efcfd1e234d __clone > {code} > Observed this failure in internal CI for test > {code:java} > MesosContainerizer/DefaultExecutorTest.TaskUsesExecutor/0{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8809) Add functions for manipulating POSIX ACLs into stout
[ https://issues.apache.org/jira/browse/MESOS-8809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454262#comment-16454262 ] Qian Zhang commented on MESOS-8809: --- RR: https://reviews.apache.org/r/66811/ > Add functions for manipulating POSIX ACLs into stout > > > Key: MESOS-8809 > URL: https://issues.apache.org/jira/browse/MESOS-8809 > Project: Mesos > Issue Type: Task > Components: stout >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > > We need to add functions for setting/getting POSIX ACLs into stout so that we > can leverage these functions to grant volume permissions to the specific task > user. > This will introduce a new dependency {{libacl-devel}} when building Mesos. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8834) libprocess底层internal::send和internal::_send相互调用, 当outgoing[socket]里一直有数据包要发送时,那么存在栈耗尽 core dump问题
[ https://issues.apache.org/jira/browse/MESOS-8834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454219#comment-16454219 ] Benno Evers commented on MESOS-8834: While I can't really understand the text, judging from the send -> _send -> send -> ... -> coredump sequence this looks like it might be the same issue as MESOS-8594? > libprocess底层internal::send和internal::_send相互调用, > 当outgoing[socket]里一直有数据包要发送时,那么存在栈耗尽 core dump问题 > > > Key: MESOS-8834 > URL: https://issues.apache.org/jira/browse/MESOS-8834 > Project: Mesos > Issue Type: Bug > Components: libprocess >Affects Versions: 1.5.0 >Reporter: liwuqi >Priority: Blocker > Labels: core, libprocess, send > > 如果某个process > while(true)发消息,将导致大量消息缓存在outgoing[socket]里,而在底层由internal::send和internal::_send去执行消息的发送,那么就会出现递归调用: > _send -> send -> _send ->send -> ... ->_send -> send -> > 导致调用栈不断增加,最终栈耗尽发生core dump问题. > 我本地测试,发现当栈层次达到40,000+时发生core dump > 为了解决这个问题,需要修改底层消息发送机制 > > 请关注这个问题,谢谢 > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8841) Flaky `MasterAllocatorTest/0.SingleFramework`
Andrei Budnik created MESOS-8841: Summary: Flaky `MasterAllocatorTest/0.SingleFramework` Key: MESOS-8841 URL: https://issues.apache.org/jira/browse/MESOS-8841 Project: Mesos Issue Type: Bug Components: allocation, master Environment: Fedora 25 master/a1c6a7a3c5 Reporter: Andrei Budnik {code:java} [ RUN ] MasterAllocatorTest/0.SingleFramework F0426 08:31:29.775804 9701 hierarchical.cpp:586] Check failed: slaves.contains(slaveId) *** Check failure stack trace: *** @ 0x7f365e108fb8 google::LogMessage::Fail() @ 0x7f365e108f15 google::LogMessage::SendToLog() @ 0x7f365e10890f google::LogMessage::Flush() @ 0x7f365e10b6d2 google::LogMessageFatal::~LogMessageFatal() @ 0x7f365c63b8d7 mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeSlave() @ 0x55728a500ac7 _ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_7SlaveIDES8_EEvRKNS_3PIDIT_EEMSA_FvT0_EOT1_ENKUlOS6_PNS_11ProcessBaseEE_clESJ_SL_ @ 0x55728a589908 _ZN5cpp176invokeIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS3_7SlaveIDESA_EEvRKNS1_3PIDIT_EEMSC_FvT0_EOT1_EUlOS8_PNS1_11ProcessBaseEE_JS8_SN_EEEDTclcl7forwardISC_Efp_Espcl7forwardIT0_Efp0_EEEOSC_DpOSP_ @ 0x55728a586a0f _ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS4_7SlaveIDESB_EEvRKNS2_3PIDIT_EEMSD_FvT0_EOT1_EUlOS9_PNS2_11ProcessBaseEE_JS9_St12_PlaceholderILi113invoke_expandISP_St5tupleIJS9_SR_EESU_IJOSO_EEJLm0ELm1DTcl6invokecl7forwardISD_Efp_Espcl6expandcl3getIXT2_EEcl7forwardISH_Efp0_EEcl7forwardISK_Efp2_OSD_OSH_N5cpp1416integer_sequenceImJXspT2_SL_ @ 0x55728a5852b0 _ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS4_7SlaveIDESB_EEvRKNS2_3PIDIT_EEMSD_FvT0_EOT1_EUlOS9_PNS2_11ProcessBaseEE_JS9_St12_PlaceholderILi1clIJSO_EEEDTcl13invoke_expandcl4movedtdefpT1fEcl4movedtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1_Ecl16forward_as_tuplespcl7forwardIT_Efp_DpOSX_ @ 0x55728a584209 _ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS6_7SlaveIDESD_EEvRKNS4_3PIDIT_EEMSF_FvT0_EOT1_EUlOSB_PNS4_11ProcessBaseEE_JSB_St12_PlaceholderILi1EJSQ_EEEDTclcl7forwardISF_Efp_Espcl7forwardIT0_Efp0_EEEOSF_DpOSV_ @ 0x55728a583995 _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS7_7SlaveIDESE_EEvRKNS5_3PIDIT_EEMSG_FvT0_EOT1_EUlOSC_PNS5_11ProcessBaseEE_JSC_St12_PlaceholderILi1EJSR_EEEvOSG_DpOT0_ @ 0x55728a581522 _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNSA_7SlaveIDESH_EEvRKNS1_3PIDIT_EEMSJ_FvT0_EOT1_EUlOSF_S3_E_JSF_St12_PlaceholderILi1EEclEOS3_ @ 0x7f365e0484c0 _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_ @ 0x7f365e025760 process::ProcessBase::consume() @ 0x7f365e033abc _ZNO7process13DispatchEvent7consumeEPNS_13EventConsumerE @ 0x55728a1cb6ea process::ProcessBase::serve() @ 0x7f365e0225ed process::ProcessManager::resume() @ 0x7f365e01e94c _ZZN7process14ProcessManager12init_threadsEvENKUlvE_clEv @ 0x7f365e031080 _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE @ 0x7f365e030a34 _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEclEv @ 0x7f365e030338 _ZNSt6thread11_State_implISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv @ 0x7f365478976f (unknown) @ 0x7f3654e6973a start_thread @ 0x7f3653eefe7f __GI___clone{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-7944) Implement jemalloc memory profiling support for Mesos
[ https://issues.apache.org/jira/browse/MESOS-7944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453831#comment-16453831 ] Alexander Rukletsov commented on MESOS-7944: {noformat} commit aa65947286d9115d1bdd34d7b7f0f0038e128345 Author: Benno Evers bev...@mesosphere.com AuthorDate: Thu Apr 26 12:01:26 2018 +0200 Commit: Alexander Rukletsov al...@apache.org CommitDate: Thu Apr 26 12:45:02 2018 +0200 Added documentation for memory profiling. Review: https://reviews.apache.org/r/63372/ {noformat} > Implement jemalloc memory profiling support for Mesos > - > > Key: MESOS-7944 > URL: https://issues.apache.org/jira/browse/MESOS-7944 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers >Assignee: Benno Evers >Priority: Major > Labels: mesosphere > Fix For: 1.6.0 > > > After investigation in MESOS-7876 and discussion on the mailing list, this > task is for tracking progress on adding out-of-the-box memory profiling > support using jemalloc to Mesos. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-7854) Authorize resource calls to provider manager api
[ https://issues.apache.org/jira/browse/MESOS-7854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453760#comment-16453760 ] Jan Schlicht commented on MESOS-7854: - Closing this in favor of MESOS-8774, as that ticket is more specific. > Authorize resource calls to provider manager api > > > Key: MESOS-7854 > URL: https://issues.apache.org/jira/browse/MESOS-7854 > Project: Mesos > Issue Type: Improvement >Reporter: Benjamin Bannier >Priority: Critical > Labels: csi-post-mvp, mesosphere, storage > > The resource provider manager provides a function > {code} > process::Future api( > const process::http::Request& request, > const Option& principal) const; > {code} > which is exposed e.g., as an agent endpoint. > We need to add authorization to this function in order to e.g., stop rough > callers. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8774) Authenticate and authorize calls to the resource provider manager's API
[ https://issues.apache.org/jira/browse/MESOS-8774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht reassigned MESOS-8774: --- Assignee: Jan Schlicht > Authenticate and authorize calls to the resource provider manager's API > > > Key: MESOS-8774 > URL: https://issues.apache.org/jira/browse/MESOS-8774 > Project: Mesos > Issue Type: Task > Components: agent >Reporter: Benjamin Bannier >Assignee: Jan Schlicht >Priority: Major > Labels: mesosphere > > The resource provider manager is exposed via an agent endpoint against which > resource providers subscribe or perform other actions. We should authenticate > and authorize any interactions there. > Since currently local resource providers run on agents who manages their > lifetime it seems natural to extend the framework used for executor > authentication to resource providers as well. The agent would then generate a > secret token whenever a new resource provider is started and inject it into > the resource providers it launches. Resource providers in turn would use this > token when interacting with the manager API. -- This message was sent by Atlassian JIRA (v7.6.3#76005)