[jira] [Created] (MESOS-8973) Add a new document for cgroups isolator
Qian Zhang created MESOS-8973: - Summary: Add a new document for cgroups isolator Key: MESOS-8973 URL: https://issues.apache.org/jira/browse/MESOS-8973 Project: Mesos Issue Type: Documentation Components: cgroups, documentation Reporter: Qian Zhang Currently we have separate docs for cgroups subsystems under [https://github.com/apache/mesos/tree/master/docs/isolators,] we'd better to merge all of them into a single doc (say `cgroups.md`) and each subsystem should have a section in it, and also describe the motivation and semantic of `cgroups/all` introduced in MESOS-7691. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-8943) Add metrics about CSI calls.
[ https://issues.apache.org/jira/browse/MESOS-8943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497460#comment-16497460 ] Chun-Hung Hsiao edited comment on MESOS-8943 at 6/1/18 1:40 AM: {noformat} commit cae7d7a385e9edf0db81d524c6a208f6ad8540fd Author: Chun-Hung Hsiao Date: Tue May 29 19:29:16 2018 -0700 Added the `RPC` enum and `RPCTraits` helper. To make it easy to enumerate all types of CSI RPC calls, the `RPC` enum is introduced. The `RPCTraits` helper class can be used to determine the request and response type of a particular RPC. Review: https://reviews.apache.org/r/67375{noformat} {noformat} commit 15fc86d22f1fbe922ef878bb2e6f6462d2248b14 Author: Chun-Hung Hsiao Date: Tue May 22 17:34:06 2018 -0700 Added per-CSI-call RPC metrics for SLRP. For each CSI call, e.g., `csi.v0.Identity.Probe`, we the following metrics for SLRP: `csi_plugin/rpcs/csi.v0.Identity.Probe/pending` `csi_plugin/rpcs/csi.v0.Identity.Probe/successes` `csi_plugin/rpcs/csi.v0.Identity.Probe/errors` `csi_plugin/rpcs/csi.v0.Identity.Probe/cancelled` To add these per-CSI-call metrics, each method in `csi::v0::Client`, e.g., `csi::v0::Client::Probe`, is changed to `csi::v0::Client::call`, to make RPC calls based on the RPC enum value. A `call` helper function in SLRP is also added to intercept CSI calls and update the corresponding metrics. Review: https://reviews.apache.org/r/67255{noformat} {noformat} commit 1a1f0bab2fde34095c643cefdb24d700441048d0 Author: Chun-Hung Hsiao Date: Tue May 22 14:47:03 2018 -0700 Added a unit test for CSI plugin RPC metrics. This patch adds the `ROOT_CsiPluginRpcMetrics` test that issues a `CREATE_VOLUME` followed by a `DESTROY_VOLUME`, which would fail due to an out-of-band deletion of the actual volume. Review: https://reviews.apache.org/r/67256{noformat} {noformat} commit db075fc67aceb8f75bbc204aae042a30b65c57e3 Author: Chun-Hung Hsiao Date: Thu May 24 18:01:26 2018 -0700 Added documentation for resource provider and CSI plugin metrics. Review: https://reviews.apache.org/r/67303{noformat} was (Author: chhsia0): {noformat} commit cae7d7a385e9edf0db81d524c6a208f6ad8540fd Author: Chun-Hung Hsiao Date: Tue May 29 19:29:16 2018 -0700 Added the `RPC` enum and `RPCTraits` helper. To make it easy to enumerate all types of CSI RPC calls, the `RPC` enum is introduced. The `RPCTraits` helper class can be used to determine the request and response type of a particular RPC. Review: https://reviews.apache.org/r/67375{noformat} {noformat} commit 15fc86d22f1fbe922ef878bb2e6f6462d2248b14 Author: Chun-Hung Hsiao Date: Tue May 22 17:34:06 2018 -0700 Added per-CSI-call RPC metrics for SLRP. For each CSI call, e.g., `csi.v0.Identity.Probe`, we the following metrics for SLRP: `csi_plugin/rpcs/csi.v0.Identity.Probe/pending` `csi_plugin/rpcs/csi.v0.Identity.Probe/successes` `csi_plugin/rpcs/csi.v0.Identity.Probe/errors` `csi_plugin/rpcs/csi.v0.Identity.Probe/cancelled` To add these per-CSI-call metrics, each method in `csi::v0::Client`, e.g., `csi::v0::Client::Probe`, is changed to `csi::v0::Client::call`, to make RPC calls based on the RPC enum value. A `call` helper function in SLRP is also added to intercept CSI calls and update the corresponding metrics. Review: https://reviews.apache.org/r/67255{noformat} {noformat} commit 1a1f0bab2fde34095c643cefdb24d700441048d0 Author: Chun-Hung Hsiao Date: Tue May 22 14:47:03 2018 -0700 Added a unit test for CSI plugin RPC metrics. This patch adds the `ROOT_CsiPluginRpcMetrics` test that issues a `CREATE_VOLUME` followed by a `DESTROY_VOLUME`, which would fail due to an out-of-band deletion of the actual volume. Review: https://reviews.apache.org/r/67256{noformat} > Add metrics about CSI calls. > > > Key: MESOS-8943 > URL: https://issues.apache.org/jira/browse/MESOS-8943 > Project: Mesos > Issue Type: Task > Components: storage >Reporter: Chun-Hung Hsiao >Assignee: Chun-Hung Hsiao >Priority: Major > Labels: mesosphere, storage > Fix For: 1.7.0 > > > We should add metrics for CSI calls so operators can be alerted on flapping > CSI plugins. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8972) when choose docker image use user network all mesos agent crash
saturnman created MESOS-8972: Summary: when choose docker image use user network all mesos agent crash Key: MESOS-8972 URL: https://issues.apache.org/jira/browse/MESOS-8972 Project: Mesos Issue Type: Bug Components: docker Affects Versions: 1.7.0 Environment: Ubuntu 14.04 & Ubuntu 16.04, both type crashes mesos Reporter: saturnman When submit docker task from marathon choose user network, then mesos process crashes with the following backtrace message mesos-agent: .././../3rdparty/stout/include/stout/option.hpp:118: const T& Option::get() const & [with T = std::__cxx11::basic_string]: Assertion `isSome()' failed. *** Aborted at 1527797505 (unix time) try "date -d @1527797505" if you are using GNU date *** PC: @ 0x7fc03d43f428 (unknown) *** SIGABRT (@0x4514) received by PID 17684 (TID 0x7fc033143700) from PID 17684; stack trace: *** @ 0x7fc03dd7d390 (unknown) @ 0x7fc03d43f428 (unknown) @ 0x7fc03d44102a (unknown) @ 0x7fc03d437bd7 (unknown) @ 0x7fc03d437c82 (unknown) @ 0x564f1ad8871d _ZNKR6OptionINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIc3getEv @ 0x7fc048c43256 mesos::internal::slave::NetworkCniIsolatorProcess::getNetworkConfigJSON() @ 0x7fc048c368cb mesos::internal::slave::NetworkCniIsolatorProcess::prepare() @ 0x7fc0486e5c18 _ZZN7process8dispatchI6OptionIN5mesos5slave19ContainerLaunchInfoEENS2_8internal5slave20MesosIsolatorProcessERKNS2_11ContainerIDERKNS3_15ContainerConfigESB_SE_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSJ_FSH_T1_T2_EOT3_OT4_ENKUlSt10unique_ptrINS_7PromiseIS5_EESt14default_deleteISX_EEOS9_OSC_PNS_11ProcessBaseEE_clES10_S11_S12_S14_ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8970) Tests relying on metrics segfault on some Linux distros.
[ https://issues.apache.org/jira/browse/MESOS-8970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-8970: -- Resolution: Fixed Assignee: Benno Evers (was: Benjamin Mahler) Fix Version/s: 1.7.0 {noformat} commit fa41ca7dcb60c6fcac8afc6ec35f36a69b90b65b Author: Benno Evers Date: Thu May 31 10:07:31 2018 -0700 Fixed a crash in libprocess due to order-of-evaluation bug. Up to C++17, the only ordering constraint on the evaluation of expressions between synchronization points was that function arguments shall be evaluated before calling a function. This could lead to the situation where `std::move(futures)` could be called before `await(futures.values())`, leading to a function call on a moved-from object and thus undefined behaviour. Review: https://reviews.apache.org/r/67401/ {noformat} > Tests relying on metrics segfault on some Linux distros. > > > Key: MESOS-8970 > URL: https://issues.apache.org/jira/browse/MESOS-8970 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.7.0 >Reporter: Alexander Rukletsov >Assignee: Benno Evers >Priority: Blocker > Labels: libprocess > Fix For: 1.7.0 > > > [Recent changes to > metrics|https://github.com/apache/mesos/compare/6ae44980c47ed99216edc81c8d4b3ad1255cd711...0f6ce843b506262acdccba50e8686ca5798aa633] > in libprocess likely trigger some UB. For example, > {noformat} > 07:12:34 [ RUN ] FetcherTest.CustomOutputFileSubdirectory > 07:12:34 I0531 07:12:34.379432 16126 fetcher.cpp:369] Starting to fetch URIs > for container: 43a2297e-54ea-46d5-89bc-df3813dde6de, directory: /tmp/018jUp > 07:12:34 I0531 07:12:34.380430 16126 fetcher.cpp:875] Fetching URIs using > command > '/home/centos/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-centos-7/mesos/build/src/mesos-fetcher' > 07:12:34 I0531 07:12:34.580570 16124 process.cpp:3583] Handling HTTP event > for process 'metrics' with path: '/metrics/snapshot' > 07:12:34 F0531 07:12:34.582866 16127 metrics.cpp:219] CHECK_SOME(timeout): is > NONE > 07:12:34 *** Check failure stack trace: *** > 07:12:34 @ 0x7f81f70f763d google::LogMessage::Fail() > 07:12:34 @ 0x7f81f70f93bd google::LogMessage::SendToLog() > 07:12:34 @ 0x7f81f70f7223 google::LogMessage::Flush() > 07:12:34 @ 0x7f81f70f9e5e google::LogMessageFatal::~LogMessageFatal() > 07:12:34 @ 0x11d0322 _CheckFatal::~_CheckFatal() > 07:12:34 @ 0x7f81f8a7e153 > process::metrics::internal::MetricsProcess::__snapshot() > 07:12:34 @ 0x7f81f8a8be88 > _ZZN7process8dispatchISt3mapISsdSt4lessISsESaISt4pairIKSsdEEENS_7metrics8internal14MetricsProcessERK6OptionI8DurationEO7hashmapISsNS_6FutureIdEESt4hashISsESt8equal_toISsEEOSH_ISsSC_INS_10StatisticsIdEEESL_SN_ESG_SO_ST_EENSI_IT_EERKNS_3PIDIT0_EEMSY_FSW_T1_T2_T3_EOT4_OT5_OT6_ENKUlSt10unique_ptrINS_7PromiseIS8_EESt14default_deleteIS1F_EEOSE_SP_SU_PNS_11ProcessBaseEE_clES1I_S1J_SP_SU_S1L_ > 07:12:34 @ 0x7f81f8ac5bea > _ZN5cpp176invokeIZN7process8dispatchISt3mapISsdSt4lessISsESaISt4pairIKSsdEEENS1_7metrics8internal14MetricsProcessERK6OptionI8DurationEO7hashmapISsNS1_6FutureIdEESt4hashISsESt8equal_toISsEEOSJ_ISsSE_INS1_10StatisticsIdEEESN_SP_ESI_SQ_SV_EENSK_IT_EERKNS1_3PIDIT0_EEMS10_FSY_T1_T2_T3_EOT4_OT5_OT6_EUlSt10unique_ptrINS1_7PromiseISA_EESt14default_deleteIS1H_EEOSG_SR_SW_PNS1_11ProcessBaseEE_JS1K_SG_SQ_SV_S1N_EEEDTclcl7forwardISX_Efp_Espcl7forwardIT0_Efp0_EEEOSX_DpOS1P_ > 07:12:34 @ 0x7f81f8ac2a34 > _ZN6lambda8internal7PartialIZN7process8dispatchISt3mapISsdSt4lessISsESaISt4pairIKSsdEEENS2_7metrics8internal14MetricsProcessERK6OptionI8DurationEO7hashmapISsNS2_6FutureIdEESt4hashISsESt8equal_toISsEEOSK_ISsSF_INS2_10StatisticsIdEEESO_SQ_ESJ_SR_SW_EENSL_IT_EERKNS2_3PIDIT0_EEMS11_FSZ_T1_T2_T3_EOT4_OT5_OT6_EUlSt10unique_ptrINS2_7PromiseISB_EESt14default_deleteIS1I_EEOSH_SS_SX_PNS2_11ProcessBaseEE_IS1L_SH_SR_SW_St12_PlaceholderILi113invoke_expandIS1P_St5tupleIIS1L_SH_SR_SW_S1R_EES1U_IIOS1O_EEILm0ELm1ELm2ELm3ELm4DTcl6invokecl7forwardISY_Efp_Espcl6expandcl3getIXT2_EEcl7forwardIS11_Efp0_EEcl7forwardIS15_Efp2_OSY_OS11_N5cpp1416integer_sequenceImIXspT2_OS15_ > 07:12:34 @ 0x7f81f8abee6e >
[jira] [Commented] (MESOS-7966) check for maintenance on agent causes fatal error
[ https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496818#comment-16496818 ] Benno Evers commented on MESOS-7966: https://reviews.apache.org/r/67403/ > check for maintenance on agent causes fatal error > - > > Key: MESOS-7966 > URL: https://issues.apache.org/jira/browse/MESOS-7966 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.1.3, 1.2.3, 1.3.2, 1.4.1, 1.5.0, 1.6.0 >Reporter: Rob Johnson >Assignee: Benno Evers >Priority: Critical > Labels: mesosphere, reliability > > We interact with the maintenance API frequently to orchestrate gracefully > draining agents of tasks without impacting service availability. > Occasionally we seem to trigger a fatal error in Mesos when interacting with > the api. This happens relatively frequently, and impacts us when downstream > frameworks (marathon) react badly to leader elections. > Here is the log line that we see when the master dies: > {code} > F0911 12:18:49.543401 123748 hierarchical.cpp:872] Check failed: > slaves[slaveId].maintenance.isSome() > {code} > It's quite possibly we're using the maintenance API in the wrong way. We're > happy to provide any other logs you need - please let me know what would be > useful for debugging. > Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8970) Tests relying on metrics segfault on some Linux distros.
[ https://issues.apache.org/jira/browse/MESOS-8970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496743#comment-16496743 ] Benno Evers commented on MESOS-8970: https://reviews.apache.org/r/67401 > Tests relying on metrics segfault on some Linux distros. > > > Key: MESOS-8970 > URL: https://issues.apache.org/jira/browse/MESOS-8970 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.7.0 >Reporter: Alexander Rukletsov >Assignee: Benjamin Mahler >Priority: Blocker > Labels: libprocess > > [Recent changes to > metrics|https://github.com/apache/mesos/compare/6ae44980c47ed99216edc81c8d4b3ad1255cd711...0f6ce843b506262acdccba50e8686ca5798aa633] > in libprocess likely trigger some UB. For example, > {noformat} > 07:12:34 [ RUN ] FetcherTest.CustomOutputFileSubdirectory > 07:12:34 I0531 07:12:34.379432 16126 fetcher.cpp:369] Starting to fetch URIs > for container: 43a2297e-54ea-46d5-89bc-df3813dde6de, directory: /tmp/018jUp > 07:12:34 I0531 07:12:34.380430 16126 fetcher.cpp:875] Fetching URIs using > command > '/home/centos/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-centos-7/mesos/build/src/mesos-fetcher' > 07:12:34 I0531 07:12:34.580570 16124 process.cpp:3583] Handling HTTP event > for process 'metrics' with path: '/metrics/snapshot' > 07:12:34 F0531 07:12:34.582866 16127 metrics.cpp:219] CHECK_SOME(timeout): is > NONE > 07:12:34 *** Check failure stack trace: *** > 07:12:34 @ 0x7f81f70f763d google::LogMessage::Fail() > 07:12:34 @ 0x7f81f70f93bd google::LogMessage::SendToLog() > 07:12:34 @ 0x7f81f70f7223 google::LogMessage::Flush() > 07:12:34 @ 0x7f81f70f9e5e google::LogMessageFatal::~LogMessageFatal() > 07:12:34 @ 0x11d0322 _CheckFatal::~_CheckFatal() > 07:12:34 @ 0x7f81f8a7e153 > process::metrics::internal::MetricsProcess::__snapshot() > 07:12:34 @ 0x7f81f8a8be88 > _ZZN7process8dispatchISt3mapISsdSt4lessISsESaISt4pairIKSsdEEENS_7metrics8internal14MetricsProcessERK6OptionI8DurationEO7hashmapISsNS_6FutureIdEESt4hashISsESt8equal_toISsEEOSH_ISsSC_INS_10StatisticsIdEEESL_SN_ESG_SO_ST_EENSI_IT_EERKNS_3PIDIT0_EEMSY_FSW_T1_T2_T3_EOT4_OT5_OT6_ENKUlSt10unique_ptrINS_7PromiseIS8_EESt14default_deleteIS1F_EEOSE_SP_SU_PNS_11ProcessBaseEE_clES1I_S1J_SP_SU_S1L_ > 07:12:34 @ 0x7f81f8ac5bea > _ZN5cpp176invokeIZN7process8dispatchISt3mapISsdSt4lessISsESaISt4pairIKSsdEEENS1_7metrics8internal14MetricsProcessERK6OptionI8DurationEO7hashmapISsNS1_6FutureIdEESt4hashISsESt8equal_toISsEEOSJ_ISsSE_INS1_10StatisticsIdEEESN_SP_ESI_SQ_SV_EENSK_IT_EERKNS1_3PIDIT0_EEMS10_FSY_T1_T2_T3_EOT4_OT5_OT6_EUlSt10unique_ptrINS1_7PromiseISA_EESt14default_deleteIS1H_EEOSG_SR_SW_PNS1_11ProcessBaseEE_JS1K_SG_SQ_SV_S1N_EEEDTclcl7forwardISX_Efp_Espcl7forwardIT0_Efp0_EEEOSX_DpOS1P_ > 07:12:34 @ 0x7f81f8ac2a34 > _ZN6lambda8internal7PartialIZN7process8dispatchISt3mapISsdSt4lessISsESaISt4pairIKSsdEEENS2_7metrics8internal14MetricsProcessERK6OptionI8DurationEO7hashmapISsNS2_6FutureIdEESt4hashISsESt8equal_toISsEEOSK_ISsSF_INS2_10StatisticsIdEEESO_SQ_ESJ_SR_SW_EENSL_IT_EERKNS2_3PIDIT0_EEMS11_FSZ_T1_T2_T3_EOT4_OT5_OT6_EUlSt10unique_ptrINS2_7PromiseISB_EESt14default_deleteIS1I_EEOSH_SS_SX_PNS2_11ProcessBaseEE_IS1L_SH_SR_SW_St12_PlaceholderILi113invoke_expandIS1P_St5tupleIIS1L_SH_SR_SW_S1R_EES1U_IIOS1O_EEILm0ELm1ELm2ELm3ELm4DTcl6invokecl7forwardISY_Efp_Espcl6expandcl3getIXT2_EEcl7forwardIS11_Efp0_EEcl7forwardIS15_Efp2_OSY_OS11_N5cpp1416integer_sequenceImIXspT2_OS15_ > 07:12:34 @ 0x7f81f8abee6e > _ZNO6lambda8internal7PartialIZN7process8dispatchISt3mapISsdSt4lessISsESaISt4pairIKSsdEEENS2_7metrics8internal14MetricsProcessERK6OptionI8DurationEO7hashmapISsNS2_6FutureIdEESt4hashISsESt8equal_toISsEEOSK_ISsSF_INS2_10StatisticsIdEEESO_SQ_ESJ_SR_SW_EENSL_IT_EERKNS2_3PIDIT0_EEMS11_FSZ_T1_T2_T3_EOT4_OT5_OT6_EUlSt10unique_ptrINS2_7PromiseISB_EESt14default_deleteIS1I_EEOSH_SS_SX_PNS2_11ProcessBaseEE_JS1L_SH_SR_SW_St12_PlaceholderILi1clIJS1O_EEEDTcl13invoke_expandcl4movedtdefpT1fEcl4movedtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1ELm2ELm3ELm4_Ecl16forward_as_tuplespcl7forwardIT_Efp_DpOS1X_ > 07:12:34 @ 0x7f81f8abca67 > _ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchISt3mapISsdSt4lessISsESaISt4pairIKSsdEEENS4_7metrics8internal14MetricsProcessERK6OptionI8DurationEO7hashmapISsNS4_6FutureIdEESt4hashISsESt8equal_toISsEEOSM_ISsSH_INS4_10StatisticsIdEEESQ_SS_ESL_ST_SY_EENSN_IT_EERKNS4_3PIDIT0_EEMS13_FS11_T1_T2_T3_EOT4_OT5_OT6_EUlSt10unique_ptrINS4_7PromiseISD_EESt14default_deleteIS1K_EEOSJ_SU_SZ_PNS4_11ProcessBaseEE_JS1N_SJ_ST_SY_St12_PlaceholderILi1EJS1Q_EEEDTclcl7forwardIS10_Efp_Espcl7forwardIT0_Efp0_EEEOS10_DpOS1V_ > 07:12:34 @ 0x7f81f8abb625 >
[jira] [Assigned] (MESOS-8971) External Resource Provider Design
[ https://issues.apache.org/jira/browse/MESOS-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun-Hung Hsiao reassigned MESOS-8971: -- Shepherd: (was: Jie Yu) Assignee: (was: Chun-Hung Hsiao) > External Resource Provider Design > - > > Key: MESOS-8971 > URL: https://issues.apache.org/jira/browse/MESOS-8971 > Project: Mesos > Issue Type: Task >Reporter: Chun-Hung Hsiao >Priority: Major > Labels: mesosphere, storage > > We need a design for external resource provider and how external resources > are used. How external resources are offered to the frameworks is a separated > issue and is not covered in this design. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8528) Design doc for External Resource Provider (ERP) support.
[ https://issues.apache.org/jira/browse/MESOS-8528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun-Hung Hsiao reassigned MESOS-8528: -- Assignee: Chun-Hung Hsiao > Design doc for External Resource Provider (ERP) support. > > > Key: MESOS-8528 > URL: https://issues.apache.org/jira/browse/MESOS-8528 > Project: Mesos > Issue Type: Task >Reporter: Jie Yu >Assignee: Chun-Hung Hsiao >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8971) External Resource Provider Design
Chun-Hung Hsiao created MESOS-8971: -- Summary: External Resource Provider Design Key: MESOS-8971 URL: https://issues.apache.org/jira/browse/MESOS-8971 Project: Mesos Issue Type: Task Reporter: Chun-Hung Hsiao Assignee: Chun-Hung Hsiao We need a design for external resource provider and how external resources are used. How external resources are offered to the frameworks is a separated issue and is not covered in this design. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8970) Tests relying on metrics segfault on some Linux distros.
[ https://issues.apache.org/jira/browse/MESOS-8970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496673#comment-16496673 ] Alexander Rukletsov commented on MESOS-8970: The following change to {{3rdparty/libprocess/src/metrics/metrics.cpp}} seems to fix the issue: {noformat} Future timedout = after(timeout.getOrElse(Duration::max())); + std::set> fset{ +timedout, +await(futures.values()).then([]{ return Nothing(); }) }; + // Return the response once it finishes or we time out. - return select({ - timedout, - await(futures.values()).then([]{ return Nothing(); }) }) + return select(fset) .onAny([=]() mutable { timedout.discard(); }) // Don't accumulate timers. .then(defer(self(), ::__snapshot, {noformat} > Tests relying on metrics segfault on some Linux distros. > > > Key: MESOS-8970 > URL: https://issues.apache.org/jira/browse/MESOS-8970 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.7.0 >Reporter: Alexander Rukletsov >Assignee: Benjamin Mahler >Priority: Blocker > Labels: libprocess > > [Recent changes to > metrics|https://github.com/apache/mesos/compare/6ae44980c47ed99216edc81c8d4b3ad1255cd711...0f6ce843b506262acdccba50e8686ca5798aa633] > in libprocess likely trigger some UB. For example, > {noformat} > 07:12:34 [ RUN ] FetcherTest.CustomOutputFileSubdirectory > 07:12:34 I0531 07:12:34.379432 16126 fetcher.cpp:369] Starting to fetch URIs > for container: 43a2297e-54ea-46d5-89bc-df3813dde6de, directory: /tmp/018jUp > 07:12:34 I0531 07:12:34.380430 16126 fetcher.cpp:875] Fetching URIs using > command > '/home/centos/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-centos-7/mesos/build/src/mesos-fetcher' > 07:12:34 I0531 07:12:34.580570 16124 process.cpp:3583] Handling HTTP event > for process 'metrics' with path: '/metrics/snapshot' > 07:12:34 F0531 07:12:34.582866 16127 metrics.cpp:219] CHECK_SOME(timeout): is > NONE > 07:12:34 *** Check failure stack trace: *** > 07:12:34 @ 0x7f81f70f763d google::LogMessage::Fail() > 07:12:34 @ 0x7f81f70f93bd google::LogMessage::SendToLog() > 07:12:34 @ 0x7f81f70f7223 google::LogMessage::Flush() > 07:12:34 @ 0x7f81f70f9e5e google::LogMessageFatal::~LogMessageFatal() > 07:12:34 @ 0x11d0322 _CheckFatal::~_CheckFatal() > 07:12:34 @ 0x7f81f8a7e153 > process::metrics::internal::MetricsProcess::__snapshot() > 07:12:34 @ 0x7f81f8a8be88 > _ZZN7process8dispatchISt3mapISsdSt4lessISsESaISt4pairIKSsdEEENS_7metrics8internal14MetricsProcessERK6OptionI8DurationEO7hashmapISsNS_6FutureIdEESt4hashISsESt8equal_toISsEEOSH_ISsSC_INS_10StatisticsIdEEESL_SN_ESG_SO_ST_EENSI_IT_EERKNS_3PIDIT0_EEMSY_FSW_T1_T2_T3_EOT4_OT5_OT6_ENKUlSt10unique_ptrINS_7PromiseIS8_EESt14default_deleteIS1F_EEOSE_SP_SU_PNS_11ProcessBaseEE_clES1I_S1J_SP_SU_S1L_ > 07:12:34 @ 0x7f81f8ac5bea > _ZN5cpp176invokeIZN7process8dispatchISt3mapISsdSt4lessISsESaISt4pairIKSsdEEENS1_7metrics8internal14MetricsProcessERK6OptionI8DurationEO7hashmapISsNS1_6FutureIdEESt4hashISsESt8equal_toISsEEOSJ_ISsSE_INS1_10StatisticsIdEEESN_SP_ESI_SQ_SV_EENSK_IT_EERKNS1_3PIDIT0_EEMS10_FSY_T1_T2_T3_EOT4_OT5_OT6_EUlSt10unique_ptrINS1_7PromiseISA_EESt14default_deleteIS1H_EEOSG_SR_SW_PNS1_11ProcessBaseEE_JS1K_SG_SQ_SV_S1N_EEEDTclcl7forwardISX_Efp_Espcl7forwardIT0_Efp0_EEEOSX_DpOS1P_ > 07:12:34 @ 0x7f81f8ac2a34 > _ZN6lambda8internal7PartialIZN7process8dispatchISt3mapISsdSt4lessISsESaISt4pairIKSsdEEENS2_7metrics8internal14MetricsProcessERK6OptionI8DurationEO7hashmapISsNS2_6FutureIdEESt4hashISsESt8equal_toISsEEOSK_ISsSF_INS2_10StatisticsIdEEESO_SQ_ESJ_SR_SW_EENSL_IT_EERKNS2_3PIDIT0_EEMS11_FSZ_T1_T2_T3_EOT4_OT5_OT6_EUlSt10unique_ptrINS2_7PromiseISB_EESt14default_deleteIS1I_EEOSH_SS_SX_PNS2_11ProcessBaseEE_IS1L_SH_SR_SW_St12_PlaceholderILi113invoke_expandIS1P_St5tupleIIS1L_SH_SR_SW_S1R_EES1U_IIOS1O_EEILm0ELm1ELm2ELm3ELm4DTcl6invokecl7forwardISY_Efp_Espcl6expandcl3getIXT2_EEcl7forwardIS11_Efp0_EEcl7forwardIS15_Efp2_OSY_OS11_N5cpp1416integer_sequenceImIXspT2_OS15_ > 07:12:34 @ 0x7f81f8abee6e > _ZNO6lambda8internal7PartialIZN7process8dispatchISt3mapISsdSt4lessISsESaISt4pairIKSsdEEENS2_7metrics8internal14MetricsProcessERK6OptionI8DurationEO7hashmapISsNS2_6FutureIdEESt4hashISsESt8equal_toISsEEOSK_ISsSF_INS2_10StatisticsIdEEESO_SQ_ESJ_SR_SW_EENSL_IT_EERKNS2_3PIDIT0_EEMS11_FSZ_T1_T2_T3_EOT4_OT5_OT6_EUlSt10unique_ptrINS2_7PromiseISB_EESt14default_deleteIS1I_EEOSH_SS_SX_PNS2_11ProcessBaseEE_JS1L_SH_SR_SW_St12_PlaceholderILi1clIJS1O_EEEDTcl13invoke_expandcl4movedtdefpT1fEcl4movedtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1ELm2ELm3ELm4_Ecl16forward_as_tuplespcl7forwardIT_Efp_DpOS1X_ > 07:12:34 @
[jira] [Created] (MESOS-8970) Tests relying on metrics segfault on some Linux distros.
Alexander Rukletsov created MESOS-8970: -- Summary: Tests relying on metrics segfault on some Linux distros. Key: MESOS-8970 URL: https://issues.apache.org/jira/browse/MESOS-8970 Project: Mesos Issue Type: Bug Affects Versions: 1.7.0 Reporter: Alexander Rukletsov Assignee: Benjamin Mahler [Recent changes to metrics|https://github.com/apache/mesos/compare/6ae44980c47ed99216edc81c8d4b3ad1255cd711...0f6ce843b506262acdccba50e8686ca5798aa633] in libprocess likely trigger some UB. For example, {noformat} 07:12:34 [ RUN ] FetcherTest.CustomOutputFileSubdirectory 07:12:34 I0531 07:12:34.379432 16126 fetcher.cpp:369] Starting to fetch URIs for container: 43a2297e-54ea-46d5-89bc-df3813dde6de, directory: /tmp/018jUp 07:12:34 I0531 07:12:34.380430 16126 fetcher.cpp:875] Fetching URIs using command '/home/centos/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-centos-7/mesos/build/src/mesos-fetcher' 07:12:34 I0531 07:12:34.580570 16124 process.cpp:3583] Handling HTTP event for process 'metrics' with path: '/metrics/snapshot' 07:12:34 F0531 07:12:34.582866 16127 metrics.cpp:219] CHECK_SOME(timeout): is NONE 07:12:34 *** Check failure stack trace: *** 07:12:34 @ 0x7f81f70f763d google::LogMessage::Fail() 07:12:34 @ 0x7f81f70f93bd google::LogMessage::SendToLog() 07:12:34 @ 0x7f81f70f7223 google::LogMessage::Flush() 07:12:34 @ 0x7f81f70f9e5e google::LogMessageFatal::~LogMessageFatal() 07:12:34 @ 0x11d0322 _CheckFatal::~_CheckFatal() 07:12:34 @ 0x7f81f8a7e153 process::metrics::internal::MetricsProcess::__snapshot() 07:12:34 @ 0x7f81f8a8be88 _ZZN7process8dispatchISt3mapISsdSt4lessISsESaISt4pairIKSsdEEENS_7metrics8internal14MetricsProcessERK6OptionI8DurationEO7hashmapISsNS_6FutureIdEESt4hashISsESt8equal_toISsEEOSH_ISsSC_INS_10StatisticsIdEEESL_SN_ESG_SO_ST_EENSI_IT_EERKNS_3PIDIT0_EEMSY_FSW_T1_T2_T3_EOT4_OT5_OT6_ENKUlSt10unique_ptrINS_7PromiseIS8_EESt14default_deleteIS1F_EEOSE_SP_SU_PNS_11ProcessBaseEE_clES1I_S1J_SP_SU_S1L_ 07:12:34 @ 0x7f81f8ac5bea _ZN5cpp176invokeIZN7process8dispatchISt3mapISsdSt4lessISsESaISt4pairIKSsdEEENS1_7metrics8internal14MetricsProcessERK6OptionI8DurationEO7hashmapISsNS1_6FutureIdEESt4hashISsESt8equal_toISsEEOSJ_ISsSE_INS1_10StatisticsIdEEESN_SP_ESI_SQ_SV_EENSK_IT_EERKNS1_3PIDIT0_EEMS10_FSY_T1_T2_T3_EOT4_OT5_OT6_EUlSt10unique_ptrINS1_7PromiseISA_EESt14default_deleteIS1H_EEOSG_SR_SW_PNS1_11ProcessBaseEE_JS1K_SG_SQ_SV_S1N_EEEDTclcl7forwardISX_Efp_Espcl7forwardIT0_Efp0_EEEOSX_DpOS1P_ 07:12:34 @ 0x7f81f8ac2a34 _ZN6lambda8internal7PartialIZN7process8dispatchISt3mapISsdSt4lessISsESaISt4pairIKSsdEEENS2_7metrics8internal14MetricsProcessERK6OptionI8DurationEO7hashmapISsNS2_6FutureIdEESt4hashISsESt8equal_toISsEEOSK_ISsSF_INS2_10StatisticsIdEEESO_SQ_ESJ_SR_SW_EENSL_IT_EERKNS2_3PIDIT0_EEMS11_FSZ_T1_T2_T3_EOT4_OT5_OT6_EUlSt10unique_ptrINS2_7PromiseISB_EESt14default_deleteIS1I_EEOSH_SS_SX_PNS2_11ProcessBaseEE_IS1L_SH_SR_SW_St12_PlaceholderILi113invoke_expandIS1P_St5tupleIIS1L_SH_SR_SW_S1R_EES1U_IIOS1O_EEILm0ELm1ELm2ELm3ELm4DTcl6invokecl7forwardISY_Efp_Espcl6expandcl3getIXT2_EEcl7forwardIS11_Efp0_EEcl7forwardIS15_Efp2_OSY_OS11_N5cpp1416integer_sequenceImIXspT2_OS15_ 07:12:34 @ 0x7f81f8abee6e _ZNO6lambda8internal7PartialIZN7process8dispatchISt3mapISsdSt4lessISsESaISt4pairIKSsdEEENS2_7metrics8internal14MetricsProcessERK6OptionI8DurationEO7hashmapISsNS2_6FutureIdEESt4hashISsESt8equal_toISsEEOSK_ISsSF_INS2_10StatisticsIdEEESO_SQ_ESJ_SR_SW_EENSL_IT_EERKNS2_3PIDIT0_EEMS11_FSZ_T1_T2_T3_EOT4_OT5_OT6_EUlSt10unique_ptrINS2_7PromiseISB_EESt14default_deleteIS1I_EEOSH_SS_SX_PNS2_11ProcessBaseEE_JS1L_SH_SR_SW_St12_PlaceholderILi1clIJS1O_EEEDTcl13invoke_expandcl4movedtdefpT1fEcl4movedtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1ELm2ELm3ELm4_Ecl16forward_as_tuplespcl7forwardIT_Efp_DpOS1X_ 07:12:34 @ 0x7f81f8abca67 _ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchISt3mapISsdSt4lessISsESaISt4pairIKSsdEEENS4_7metrics8internal14MetricsProcessERK6OptionI8DurationEO7hashmapISsNS4_6FutureIdEESt4hashISsESt8equal_toISsEEOSM_ISsSH_INS4_10StatisticsIdEEESQ_SS_ESL_ST_SY_EENSN_IT_EERKNS4_3PIDIT0_EEMS13_FS11_T1_T2_T3_EOT4_OT5_OT6_EUlSt10unique_ptrINS4_7PromiseISD_EESt14default_deleteIS1K_EEOSJ_SU_SZ_PNS4_11ProcessBaseEE_JS1N_SJ_ST_SY_St12_PlaceholderILi1EJS1Q_EEEDTclcl7forwardIS10_Efp_Espcl7forwardIT0_Efp0_EEEOS10_DpOS1V_ 07:12:34 @ 0x7f81f8abb625
[jira] [Commented] (MESOS-7966) check for maintenance on agent causes fatal error
[ https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496526#comment-16496526 ] Benno Evers commented on MESOS-7966: > I wasn't aware that Marathon had its own reasons for doing dynamic > reservations. Do you have any details you can share on why it does or a link > to some code? I was just basing this on the following log lines, and the fact that marathon is the only framework ever mentioned as receiving inverse offers. {noformat} I0502 15:00:57.588295 20632 master.cpp:7769] Sending 1 inverse offers to framework 487b53f1-1a44-44b5-bf9f-24790937b51a-0001 (marathon1) at scheduler-e96a9f61-720c-4c0c-9018-60224ab59031@10.65.137.102:40886 {noformat} Actually, on re-reading the allocator code, it seems that it is enough for a framework to use any resources on the host scheduled for maintenance, so the focus on reservations was probably a bit of a red herring. It shouldn't change anything about the underlying race, though. > check for maintenance on agent causes fatal error > - > > Key: MESOS-7966 > URL: https://issues.apache.org/jira/browse/MESOS-7966 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.1.0 >Reporter: Rob Johnson >Assignee: Benno Evers >Priority: Critical > Labels: mesosphere, reliability > > We interact with the maintenance API frequently to orchestrate gracefully > draining agents of tasks without impacting service availability. > Occasionally we seem to trigger a fatal error in Mesos when interacting with > the api. This happens relatively frequently, and impacts us when downstream > frameworks (marathon) react badly to leader elections. > Here is the log line that we see when the master dies: > {code} > F0911 12:18:49.543401 123748 hierarchical.cpp:872] Check failed: > slaves[slaveId].maintenance.isSome() > {code} > It's quite possibly we're using the maintenance API in the wrong way. We're > happy to provide any other logs you need - please let me know what would be > useful for debugging. > Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-7966) check for maintenance on agent causes fatal error
[ https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496309#comment-16496309 ] Matthew Mead-Briggs commented on MESOS-7966: This is great sleuthing! Probably of note here is that for PaaSTA we do use dynamic reservations via the API to attempt to prevent tasks getting scheduled on maintenanced hosts. I'm actually looking at a way to change how we do this but the rough idea of how we do it now is: * mark host for maintenance * reserve all the resources with a dummy role * paasta scales up affected marathon apps and kills off tasks on the affected host * after each task is killed we reserve the resources we've just freed up I wasn't aware that Marathon had its own reasons for doing dynamic reservations. Do you have any details you can share on why it does or a link to some code? > check for maintenance on agent causes fatal error > - > > Key: MESOS-7966 > URL: https://issues.apache.org/jira/browse/MESOS-7966 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.1.0 >Reporter: Rob Johnson >Assignee: Benno Evers >Priority: Critical > Labels: mesosphere, reliability > > We interact with the maintenance API frequently to orchestrate gracefully > draining agents of tasks without impacting service availability. > Occasionally we seem to trigger a fatal error in Mesos when interacting with > the api. This happens relatively frequently, and impacts us when downstream > frameworks (marathon) react badly to leader elections. > Here is the log line that we see when the master dies: > {code} > F0911 12:18:49.543401 123748 hierarchical.cpp:872] Check failed: > slaves[slaveId].maintenance.isSome() > {code} > It's quite possibly we're using the maintenance API in the wrong way. We're > happy to provide any other logs you need - please let me know what would be > useful for debugging. > Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8969) Make check fails on Ubuntu 14.04
Quan Li created MESOS-8969: -- Summary: Make check fails on Ubuntu 14.04 Key: MESOS-8969 URL: https://issues.apache.org/jira/browse/MESOS-8969 Project: Mesos Issue Type: Bug Components: build Affects Versions: 1.6.0 Environment: Ubuntu 14.04.5 LTS, gcc 4.8.4 Reporter: Quan Li make check {code:java} [...] mv -f tests/.deps/mesos_tests-slave_validation_tests.Tpo tests/.deps/mesos_tests-slave_validation_tests.Po g++ -DPACKAGE_NAME=\"mesos\" -DPACKAGE_TARNAME=\"mesos\" -DPACKAGE_VERSION=\"1.6.0\" -DPACKAGE_STRING=\"mesos\ 1.6.0\" -DPACKAGE_BUGREPORT=\"\" -DPACKAGE_URL=\"\" -DPACKAGE=\"mesos\" -DVERSION=\"1.6.0\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DHAVE_CXX11=1 -DHAVE_PTHREAD_PRIO_INHERIT=1 -DHAVE_PTHREAD=1 -DHAVE_FTS_H=1 -DHAVE_APR_POOLS_H=1 -DHAVE_LIBAPR_1=1 -DHAVE_LIBCURL=1 -DMESOS_HAS_JAVA=1 -DHAVE_LIBSASL2=1 -DHAVE_OPENSSL_SSL_H=1 -DHAVE_SVN_VERSION_H=1 -DHAVE_LIBSVN_SUBR_1=1 -DHAVE_SVN_DELTA_H=1 -DHAVE_LIBSVN_DELTA_1=1 -DHAVE_ZLIB_H=1 -DHAVE_LIBZ=1 -DHAVE_PYTHON=\"2.7\" -DMESOS_HAS_PYTHON=1 -I. -I../../src -Werror -DLIBDIR=\"/usr/local/lib\" -DPKGLIBEXECDIR=\"/usr/local/libexec/mesos\" -DPKGDATADIR=\"/usr/local/share/mesos\" -DPKGMODULEDIR=\"/usr/local/lib/mesos/modules\" -I../../include -I../include -I../include/mesos -DPICOJSON_USE_INT64 -D__STDC_FORMAT_MACROS -I../3rdparty/boost-1.65.0 -I../3rdparty/concurrentqueue-7b69a8f -I../3rdparty/elfio-3.2 -I../3rdparty/glog-0.3.3/src -I../3rdparty/leveldb-1.19/include -I../../3rdparty/libprocess/include -I../3rdparty/nvml-352.79 -I../3rdparty/picojson-1.3.0 -I../3rdparty/protobuf-3.5.0/src -I../../3rdparty/stout/include -I../3rdparty/zookeeper-3.4.8/src/c/include -I../3rdparty/zookeeper-3.4.8/src/c/generated -I../include/csi -DSOURCE_DIR=\"/home/liquan/Mesos/mesos-1.6.0/build/..\" -DBUILD_DIR=\"/home/liquan/Mesos/mesos-1.6.0/build\" -I../3rdparty/googletest-release-1.8.0/googletest/include -I../3rdparty/googletest-release-1.8.0/googlemock/include -DTESTLIBEXECDIR=\"/usr/local/libexec/mesos/tests\" -DSBINDIR=\"/usr/local/sbin\" -I/usr/lib/java/jdk1.8.0_171/include -I/usr/lib/java/jdk1.8.0_171/include/linux -DZOOKEEPER_VERSION=\"3.4.8\" -I/usr/include/subversion-1 -I/usr/include/apr-1 -I/usr/include/apr-1.0 -pthread -Wall -Wsign-compare -Wformat-security -fstack-protector -fPIC -fPIE -g1 -O0 -Wno-unused-local-typedefs -std=c++11 -MT tests/mesos_tests-slave_tests.o -MD -MP -MF tests/.deps/mesos_tests-slave_tests.Tpo -c -o tests/mesos_tests-slave_tests.o `test -f 'tests/slave_tests.cpp' || echo '../../src/'`tests/slave_tests.cpp In file included from ../3rdparty/googletest-release-1.8.0/googletest/include/gtest/internal/gtest-param-util.h:50:0, from ../3rdparty/googletest-release-1.8.0/googletest/include/gtest/gtest-param-test.h:192, from ../3rdparty/googletest-release-1.8.0/googletest/include/gtest/gtest.h:62, from ../3rdparty/googletest-release-1.8.0/googlemock/include/gmock/internal/gmock-internal-utils.h:47, from ../3rdparty/googletest-release-1.8.0/googlemock/include/gmock/gmock-actions.h:46, from ../3rdparty/googletest-release-1.8.0/googlemock/include/gmock/gmock.h:58, from ../../src/tests/slave_tests.cpp:27: ../3rdparty/googletest-release-1.8.0/googletest/include/gtest/gtest-printers.h: In function 'void testing::internal::DefaultPrintTo(testing::internal::IsNotContainer, testing::internal::false_type, const T&, std::ostream*) [with T = process::Future (mesos::internal::slave::Slave::*)(); testing::internal::IsNotContainer = char; testing::internal::false_type = testing::internal::bool_constant; std::ostream = std::basic_ostream]': ../3rdparty/googletest-release-1.8.0/googletest/include/gtest/gtest-printers.h:439:3: internal compiler error: Segmentation fault ::testing_internal::DefaultPrintNonContainerTo(value, os); ^ Please submit a full bug report, with preprocessed source if appropriate. See for instructions. The bug is not reproducible, so it is likely a hardware or OS problem. make[3]: *** [tests/mesos_tests-slave_tests.o] Error 1 make[3]: Leaving directory `/home/liquan/Mesos/mesos-1.6.0/build/src' make[2]: *** [check-am] Error 2 make[2]: Leaving directory `/home/liquan/Mesos/mesos-1.6.0/build/src' make[1]: *** [check] Error 2 make[1]: Leaving directory `/home/liquan/Mesos/mesos-1.6.0/build/src' make: *** [check-recursive] Error 1 {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)