[jira] [Commented] (MESOS-8727) JSON -> protobuf conversion in stout handles duplicated keys in a map incorrectly
[ https://issues.apache.org/jira/browse/MESOS-8727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16410757#comment-16410757 ] Qian Zhang commented on MESOS-8727: --- The root cause of this issue is, if we call {{JSON::parse()}} with a JSON string which have duplicated map keys, the last key seen is used. > JSON -> protobuf conversion in stout handles duplicated keys in a map > incorrectly > - > > Key: MESOS-8727 > URL: https://issues.apache.org/jira/browse/MESOS-8727 > Project: Mesos > Issue Type: Bug > Components: stout >Reporter: Qian Zhang >Priority: Major > > In Mesos code, we usually use the following two functions in stout to convert > a JSON string to a protobuf message. > # {{JSON::parse()}} to convert a JSON string to a JSON object (i.e., > {{JSON::Object}}). > # {{protobuf::parse()}} to convert the JSON object to a protobuf message. > In Google protobuf, there is a single function which can be used to achieve > the same goal: {{JsonStringToMessage()}}. And based on [the doc of Google > protobuf|https://developers.google.com/protocol-buffers/docs/proto#maps], if > there are duplicated keys in a map in a JSON string, the conversion to > protobuf message may fail, i.e., if we use {{JsonStringToMessage}} to convert > the following JSON string to a protobuf message, it will fail with an error > like {{int32_to_string[0]: Repeated map key: '1' is already set.}} > {code:java} > "int32_to_string": { > "1": "value1", > "1": "value2" > } > {code} > However, {{JSON::parse()}} and {{protobuf::parse()}} handles this case > differently: they will succeed, and in the resulted protobuf message, we will > see only one key-value pair {{"1": "value2"}}, i.e., the first key-value pair > is overwritten. We should have the same behavior with {{JsonStringToMessage}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8727) JSON -> protobuf conversion in stout handles duplicated keys in a map incorrectly
Qian Zhang created MESOS-8727: - Summary: JSON -> protobuf conversion in stout handles duplicated keys in a map incorrectly Key: MESOS-8727 URL: https://issues.apache.org/jira/browse/MESOS-8727 Project: Mesos Issue Type: Bug Components: stout Reporter: Qian Zhang In Mesos code, we usually use the following two functions in stout to convert a JSON string to a protobuf message. # {{JSON::parse()}} to convert a JSON string to a JSON object (i.e., {{JSON::Object}}). # {{protobuf::parse()}} to convert the JSON object to a protobuf message. In Google protobuf, there is a single function which can be used to achieve the same goal: {{JsonStringToMessage()}}. And based on [the doc of Google protobuf|https://developers.google.com/protocol-buffers/docs/proto#maps], if there are duplicated keys in a map in a JSON string, the conversion to protobuf message may fail, i.e., if we use {{JsonStringToMessage}} to convert the following JSON string to a protobuf message, it will fail with an error like {{int32_to_string[0]: Repeated map key: '1' is already set.}} {code:java} "int32_to_string": { "1": "value1", "1": "value2" } {code} However, {{JSON::parse()}} and {{protobuf::parse()}} handles this case differently: they will succeed, and in the resulted protobuf message, we will see only one key-value pair {{"1": "value2"}}, i.e., the first key-value pair is overwritten. We should have the same behavior with {{JsonStringToMessage}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-8530) Default executor tasks can get stuck in KILLING state
[ https://issues.apache.org/jira/browse/MESOS-8530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16368048#comment-16368048 ] Gastón Kleiman edited comment on MESOS-8530 at 3/23/18 2:18 AM: https://reviews.apache.org/r/65692/ https://reviews.apache.org/r/65693/ https://reviews.apache.org/r/66232/ https://reviews.apache.org/r/65694/ https://reviews.apache.org/r/66233/ https://reviews.apache.org/r/65962/ https://reviews.apache.org/r/66234/ was (Author: gkleiman): https://reviews.apache.org/r/65692/ https://reviews.apache.org/r/65693/ https://reviews.apache.org/r/65694/ https://reviews.apache.org/r/65695/ https://reviews.apache.org/r/66123/ > Default executor tasks can get stuck in KILLING state > - > > Key: MESOS-8530 > URL: https://issues.apache.org/jira/browse/MESOS-8530 > Project: Mesos > Issue Type: Bug > Components: executor >Affects Versions: 1.2.3, 1.3.1, 1.4.1, 1.5.0 >Reporter: Gastón Kleiman >Assignee: Gastón Kleiman >Priority: Critical > Labels: default-executor, mesosphere > > The default executor will transition a task to {{TASK_KILLING}} and mark its > container as being killed before issuing the {{KILL_NESTED_CONTAINER}} call. > If the kill call fails, the task will get stuck in {{TASK_KILLING}}, and the > executor won't allow retrying the kill. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8726) Default executor doesn't retry SIGTERM kills
Gastón Kleiman created MESOS-8726: - Summary: Default executor doesn't retry SIGTERM kills Key: MESOS-8726 URL: https://issues.apache.org/jira/browse/MESOS-8726 Project: Mesos Issue Type: Bug Components: executor Reporter: Gastón Kleiman Once https://issues.apache.org/jira/browse/MESOS-8530 is resolved, the default executor will retry the kill escalation (SIGKILL), but not the initial SIGTERM. Tasks won't get stuck anymore, but this is still bad, because it could prevent tasks from gracefully shutting down. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8725) Support deadline for tasks
[ https://issues.apache.org/jira/browse/MESOS-8725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16410613#comment-16410613 ] Zhitao Li commented on MESOS-8725: -- [~jamesmulcahy], we actually started on that path, however some of the scalability difficulties we met: * limited compute resource on scheduler: a lot schedulers takes same design of Mesos master and only run one active process, and tracking a timer per task there uses up precious resources there; * network partition: if master/agent was under network partition, the scheduler could not terminate the task; * recovery upon scheduler restart: this was the biggest problem for us, but when our scheduler process restarted, it needed to recover "all" running tasks from database and reconstruct what to do for each task (which is also a common pattern among schedulers). Any additional features introduced there will further made the process heavier; * cheaper to implement in executor: with isolation mechanisms like `pid`, we expect that executor has a longer lifecycle. Therefore, executors do not even need to maintain a busy thread, but simply use a [Timer|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/include/process/timer.hpp] and terminate the task. > Support deadline for tasks > -- > > Key: MESOS-8725 > URL: https://issues.apache.org/jira/browse/MESOS-8725 > Project: Mesos > Issue Type: Improvement >Reporter: Zhitao Li >Priority: Major > > In our environment, we run a lot of batch jobs, some of which have tight > timeline. If any tasks in the job runs longer than x hours, it does not make > sense to run it anymore. > > For instance, a team would submit a job which builds a weekly index and > repeats every Monday. If the job does not finish before next Monday for > whatever reason, there is no point to keep any task running. > > We believe that implementing deadline tracking distributed across our cluster > makes more sense as it makes the system more scalable and also makes our > centralized state machine simpler. > > One idea I have right now is to add an *optional* *TimeInfo deadline* to > TaskInfo field, and all default executors in Mesos can simply terminate the > task and send a proper *StatusUpdate.* -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8725) Support deadline for tasks
[ https://issues.apache.org/jira/browse/MESOS-8725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16410505#comment-16410505 ] James Mulcahy commented on MESOS-8725: -- Is this actually simpler overall? The framework will know the deadline for the task itself, and could kill the task if that expired, without any changes in Mesos today. I could see an argument for decentralizing this to the agents if this was an "expensive" thing to check, but it seems like a relatively low overhead + low complexity task for a framework to track – even with say, millions of tasks? > Support deadline for tasks > -- > > Key: MESOS-8725 > URL: https://issues.apache.org/jira/browse/MESOS-8725 > Project: Mesos > Issue Type: Improvement >Reporter: Zhitao Li >Priority: Major > > In our environment, we run a lot of batch jobs, some of which have tight > timeline. If any tasks in the job runs longer than x hours, it does not make > sense to run it anymore. > > For instance, a team would submit a job which builds a weekly index and > repeats every Monday. If the job does not finish before next Monday for > whatever reason, there is no point to keep any task running. > > We believe that implementing deadline tracking distributed across our cluster > makes more sense as it makes the system more scalable and also makes our > centralized state machine simpler. > > One idea I have right now is to add an *optional* *TimeInfo deadline* to > TaskInfo field, and all default executors in Mesos can simply terminate the > task and send a proper *StatusUpdate.* -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8714) Cleanup `containers_` hashmap once container exits
[ https://issues.apache.org/jira/browse/MESOS-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16410015#comment-16410015 ] Andrei Budnik commented on MESOS-8714: -- Composing c'zer [subscribes|https://github.com/apache/mesos/blob/5b655ce062ff55cdefed119d97ad923aeeb2efb5/src/slave/containerizer/composing.cpp#L356-L357] on container termination after successful launch. So we always clean up this hash map. After changes in composing c'zer, this invariant (that we always clean up terminated containers) should remain unchanged. I think that there should be only one place, where we do cleanup: `ComposingContainerizerProcess::_launch`. > Cleanup `containers_` hashmap once container exits > -- > > Key: MESOS-8714 > URL: https://issues.apache.org/jira/browse/MESOS-8714 > Project: Mesos > Issue Type: Task >Reporter: Andrei Budnik >Priority: Major > > To clean up a `containers_` hash map in composing c'zer, we need to subscribe > on a container termination event in `_launch` method. Also, it's desirable to > limit the number of places where we do the clean up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8724) G++ Warning about libc system macros `major` and `minor` prevents Mesos build
[ https://issues.apache.org/jira/browse/MESOS-8724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16409916#comment-16409916 ] Benno Evers commented on MESOS-8724: One subtle thing to keep in mind, if we decide to "properly" fix it by getting protoc to add the correct #undef's for minor and major, we should take care to *not* backport the patch to older mesos versions, since that would remove the previously defined function `csi::Version::gnu_dev_major()`, causing ABI incompatibility for people upgrading libmesos.so. > G++ Warning about libc system macros `major` and `minor` prevents Mesos build > - > > Key: MESOS-8724 > URL: https://issues.apache.org/jira/browse/MESOS-8724 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers >Priority: Major > > On linux systems, the header `` defines three macros called > makedev(), major() and minor(). (See also > [http://man7.org/linux/man-pages/man3/makedev.3.html]) > Trying to compile Mesos using g++ 7.2.0 leads to the following warning: > {noformat} > ../include/csi/csi.pb.h:6042:13: error: In the GNU C Library, "minor" is > defined > by . For historical compatibility, it is > currently defined by as well, but we plan to > remove this soon. To use "minor", include > directly. If you did not intend to use a system-defined macro > "minor", you should undefine it after including . [-Werror] > inline ::google::protobuf::uint32 Version::minor() const { > {noformat} > The root cause is that csi.proto defines the following protobuf message: > {noformat} > message Version { > uint32 major = 1; // This field is REQUIRED. > uint32 minor = 2; // This field is REQUIRED. > uint32 patch = 3; // This field is REQUIRED. > } > {noformat} > The generated C++ in `csi.pb.h` headers will contain, amongst others, the > following function: > {noformat} > #include > // [6000 lines of code...] > inline ::google::protobuf::uint32 Version::major() const { > // @@protoc_insertion_point(field_get:csi.Version.major) > return major_; > } > {noformat} > And the recursive include structure of the header `` leads to > `stdlib.h` as follows: > {noformat} > . /usr/include/c++/7/string > .. /usr/include/c++/7/bits/basic_string.h > ... /usr/include/c++/7/ext/string_conversions.h > /usr/include/c++/7/cstdlib > . /usr/include/stdlib.h > .. /usr/include/x86_64-linux-gnu/sys/types.h > ... /usr/include/x86_64-linux-gnu/sys/sysmacros.h{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8725) Support deadline for tasks
Zhitao Li created MESOS-8725: Summary: Support deadline for tasks Key: MESOS-8725 URL: https://issues.apache.org/jira/browse/MESOS-8725 Project: Mesos Issue Type: Improvement Reporter: Zhitao Li In our environment, we run a lot of batch jobs, some of which have tight timeline. If any tasks in the job runs longer than x hours, it does not make sense to run it anymore. For instance, a team would submit a job which builds a weekly index and repeats every Monday. If the job does not finish before next Monday for whatever reason, there is no point to keep any task running. We believe that implementing deadline tracking distributed across our cluster makes more sense as it makes the system more scalable and also makes our centralized state machine simpler. One idea I have right now is to add an *optional* *TimeInfo deadline* to TaskInfo field, and all default executors in Mesos can simply terminate the task and send a proper *StatusUpdate.* -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8714) Cleanup `containers_` hashmap once container exits
[ https://issues.apache.org/jira/browse/MESOS-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16409871#comment-16409871 ] Greg Mann commented on MESOS-8714: -- So it looks like we currently only remove container IDs from the {{containers_}} map when {{destroy()}} is called on a container, but not for other cases of container termination? > Cleanup `containers_` hashmap once container exits > -- > > Key: MESOS-8714 > URL: https://issues.apache.org/jira/browse/MESOS-8714 > Project: Mesos > Issue Type: Task >Reporter: Andrei Budnik >Priority: Major > > To clean up a `containers_` hash map in composing c'zer, we need to subscribe > on a container termination event in `_launch` method. Also, it's desirable to > limit the number of places where we do the clean up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8724) G++ Warning about libc system macros `major` and `minor` prevents Mesos build
Benno Evers created MESOS-8724: -- Summary: G++ Warning about libc system macros `major` and `minor` prevents Mesos build Key: MESOS-8724 URL: https://issues.apache.org/jira/browse/MESOS-8724 Project: Mesos Issue Type: Bug Reporter: Benno Evers On linux systems, the header `` defines three macros called makedev(), major() and minor(). (See also http://man7.org/linux/man-pages/man3/makedev.3.html) Trying to compile Mesos using g++ 7.2.0 leads to the following warning: {noformat} ../include/csi/csi.pb.h:6042:13: error: In the GNU C Library, "minor" is defined by . For historical compatibility, it is currently defined by as well, but we plan to remove this soon. To use "minor", include directly. If you did not intend to use a system-defined macro "minor", you should undefine it after including . [-Werror] inline ::google::protobuf::uint32 Version::minor() const { {noformat} The root cause is that csi.proto defines the following protobuf message: {noformat} message Version { uint32 major = 1; // This field is REQUIRED. uint32 minor = 2; // This field is REQUIRED. uint32 patch = 3; // This field is REQUIRED. } {noformat} The generated C++ in `csi.pb.h` headers will contain, amongst others, the following function: {noformat} #include // [6000 lines of code...] inline ::google::protobuf::uint32 Version::major() const { // @@protoc_insertion_point(field_get:csi.Version.major) return major_; } {noformat} And the recursive include structure of the header `` leads to `stdlib.h` as follows: {noformat} . /usr/include/c++/7/string .. /usr/include/c++/7/bits/basic_string.h ... /usr/include/c++/7/ext/string_conversions.h /usr/include/c++/7/cstdlib . /usr/include/stdlib.h{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8723) ROOT_HealthCheckUsingPersistentVolume is flaky.
Alexander Rukletsov created MESOS-8723: -- Summary: ROOT_HealthCheckUsingPersistentVolume is flaky. Key: MESOS-8723 URL: https://issues.apache.org/jira/browse/MESOS-8723 Project: Mesos Issue Type: Bug Components: test Affects Versions: 1.5.0 Environment: ec2's CentOS 7 with SSL Reporter: Alexander Rukletsov Attachments: ROOT_HealthCheckUsingPersistentVolume-badrun.txt {noformat} ../../src/tests/cluster.cpp:660: Failure Failed to wait 15secs for destroy I0321 19:45:11.676262 8064 master.cpp:1137] Master terminating I0321 19:45:11.676625 27242 hierarchical.cpp:609] Removed agent b7675b9a-d9e9-4c97-a5c2-d50fc6101301-S0 {noformat} Full log attached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8550) Bug in `Master::detected()` leads to coredump in `MasterZooKeeperTest.MasterInfoAddress`.
[ https://issues.apache.org/jira/browse/MESOS-8550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16409660#comment-16409660 ] Alexander Rukletsov commented on MESOS-8550: Backport to 1.4.x: {noformat} commit 986894193810e271f4e15db9743bb9e1f6a24b01 Author: Benno Evers AuthorDate: Thu Mar 22 15:10:30 2018 +0100 Commit: Alexander Rukletsov CommitDate: Thu Mar 22 15:49:06 2018 +0100 Handled 'None' passed from the MasterDetector in 'Master::detect()'. The function `MasterDetector::detect()` returns a value of type `Future>`, which, according to its documentation, can be `None` if an election occured and no master is elected. However, the code in `Master::detected()` was only handling the cases of a failed future or a valid `MasterInfo` object. *NOTE*: This commit does not add a corresponding unit test, since that would require starting a non-leading master. For the ZooKeeperMasterDetector, this is blocked by MESOS-2976, and an API change to make this possible with the StandaloneMasterDetector would add a lot of complexity to the `cluster::Master::start()` function for a feature that is unlikely to be re-used in any other test. Review: https://reviews.apache.org/r/65571/ (cherry picked from commit 972f31752dd99a59903370b9ebcf078501fa8ffc) {noformat} > Bug in `Master::detected()` leads to coredump in > `MasterZooKeeperTest.MasterInfoAddress`. > - > > Key: MESOS-8550 > URL: https://issues.apache.org/jira/browse/MESOS-8550 > Project: Mesos > Issue Type: Bug > Components: leader election, master >Affects Versions: 1.5.0 >Reporter: Andrei Budnik >Assignee: Benno Evers >Priority: Major > Labels: mesosphere > Fix For: 1.4.2, 1.6.0, 1.5.1 > > Attachments: MasterZooKeeperTest.MasterInfoAddress-badrun.txt > > > {code:java} > 15:55:17 Assertion failed: (isSome()), function get, file > ../../3rdparty/stout/include/stout/option.hpp, line 119. > 15:55:17 *** Aborted at 1518018924 (unix time) try "date -d @1518018924" if > you are using GNU date *** > 15:55:17 PC: @ 0x7fff4f8f2e3e __pthread_kill > 15:55:17 *** SIGABRT (@0x7fff4f8f2e3e) received by PID 39896 (TID > 0x70427000) stack trace: *** > 15:55:17 @ 0x7fff4fa24f5a _sigtramp > 15:55:17 I0207 07:55:24.945252 4890624 group.cpp:511] ZooKeeper session > expired > 15:55:17 @ 0x70425500 (unknown) > 15:55:17 2018-02-07 07:55:24,945:39896(0x70633000):ZOO_INFO@log_env@794: > Client > environment:user.dir=/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/1mHCvU > 15:55:17 @ 0x7fff4f84f312 abort > 15:55:17 2018-02-07 > 07:55:24,945:39896(0x70633000):ZOO_INFO@zookeeper_init@827: Initiating > client connection, host=127.0.0.1:52197 sessionTimeout=1 > watcher=0x10d916590 sessionId=0 sessionPasswd= context=0x7fe1bda706a0 > flags=0 > 15:55:17 @ 0x7fff4f817368 __assert_rtn > 15:55:17 @0x10b9cff97 _ZNR6OptionIN5mesos10MasterInfoEE3getEv > 15:55:17 @0x10bbb04b5 Option<>::operator->() > 15:55:17 @0x10bd4514a mesos::internal::master::Master::detected() > 15:55:17 @0x10bf54558 > _ZZN7process8dispatchIN5mesos8internal6master6MasterERKNS_6FutureI6OptionINS1_10MasterInfoSB_EEvRKNS_3PIDIT_EEMSD_FvT0_EOT1_ENKUlOS9_PNS_11ProcessBaseEE_clESM_SO_ > 15:55:17 @0x10bf54310 > _ZN5cpp176invokeIZN7process8dispatchIN5mesos8internal6master6MasterERKNS1_6FutureI6OptionINS3_10MasterInfoSD_EEvRKNS1_3PIDIT_EEMSF_FvT0_EOT1_EUlOSB_PNS1_11ProcessBaseEE_JSB_SQ_EEEDTclclsr3stdE7forwardISF_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSF_DpOSS_ > 15:55:17 @0x10bf542bb > _ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master6MasterERKNS2_6FutureI6OptionINS4_10MasterInfoSE_EEvRKNS2_3PIDIT_EEMSG_FvT0_EOT1_EUlOSC_PNS2_11ProcessBaseEE_JSC_NSt3__112placeholders4__phILi1E13invoke_expandISS_NST_5tupleIJSC_SW_EEENSZ_IJOSR_EEEJLm0ELm1DTclsr5cpp17E6invokeclsr3stdE7forwardISG_Efp_Espcl6expandclsr3stdE3getIXT2_EEclsr3stdE7forwardISK_Efp0_EEclsr3stdE7forwardISN_Efp2_OSG_OSK_N5cpp1416integer_sequenceImJXspT2_SO_ > 15:55:17 @0x10bf541f3 > _ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master6MasterERKNS2_6FutureI6OptionINS4_10MasterInfoSE_EEvRKNS2_3PIDIT_EEMSG_FvT0_EOT1_EUlOSC_PNS2_11ProcessBaseEE_JSC_NSt3__112placeholders4__phILi1EclIJSR_EEEDTcl13invoke_expandclL_ZNST_4moveIRSS_EEONST_16remove_referenceISG_E4typeEOSG_EdtdefpT1fEclL_ZNSZ_IRNST_5tupleIJSC_SW_ES14_S15_EdtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1_Eclsr3stdE16forward_as_tuplespclsr3stdE7forwardIT_Efp_DpOS1C_ > 15:55:17
[jira] [Commented] (MESOS-8550) Bug in `Master::detected()` leads to coredump in `MasterZooKeeperTest.MasterInfoAddress`
[ https://issues.apache.org/jira/browse/MESOS-8550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16409630#comment-16409630 ] Alexander Rukletsov commented on MESOS-8550: Backport to 1.5.x: {noformat} commit 9281f922d7ec527763b3f88793b6821337f9c665 Author: Benno Evers AuthorDate: Thu Mar 22 15:10:30 2018 +0100 Commit: Alexander Rukletsov CommitDate: Thu Mar 22 15:30:27 2018 +0100 Handled 'None' passed from the MasterDetector in 'Master::detect()'. The function `MasterDetector::detect()` returns a value of type `Future>`, which, according to its documentation, can be `None` if an election occured and no master is elected. However, the code in `Master::detected()` was only handling the cases of a failed future or a valid `MasterInfo` object. *NOTE*: This commit does not add a corresponding unit test, since that would require starting a non-leading master. For the ZooKeeperMasterDetector, this is blocked by MESOS-2976, and an API change to make this possible with the StandaloneMasterDetector would add a lot of complexity to the `cluster::Master::start()` function for a feature that is unlikely to be re-used in any other test. Review: https://reviews.apache.org/r/65571/ (cherry picked from commit 972f31752dd99a59903370b9ebcf078501fa8ffc) {noformat} > Bug in `Master::detected()` leads to coredump in > `MasterZooKeeperTest.MasterInfoAddress` > > > Key: MESOS-8550 > URL: https://issues.apache.org/jira/browse/MESOS-8550 > Project: Mesos > Issue Type: Bug > Components: leader election, master >Affects Versions: 1.5.0 >Reporter: Andrei Budnik >Assignee: Benno Evers >Priority: Major > Labels: mesosphere > Fix For: 1.6.0, 1.5.1 > > Attachments: MasterZooKeeperTest.MasterInfoAddress-badrun.txt > > > {code:java} > 15:55:17 Assertion failed: (isSome()), function get, file > ../../3rdparty/stout/include/stout/option.hpp, line 119. > 15:55:17 *** Aborted at 1518018924 (unix time) try "date -d @1518018924" if > you are using GNU date *** > 15:55:17 PC: @ 0x7fff4f8f2e3e __pthread_kill > 15:55:17 *** SIGABRT (@0x7fff4f8f2e3e) received by PID 39896 (TID > 0x70427000) stack trace: *** > 15:55:17 @ 0x7fff4fa24f5a _sigtramp > 15:55:17 I0207 07:55:24.945252 4890624 group.cpp:511] ZooKeeper session > expired > 15:55:17 @ 0x70425500 (unknown) > 15:55:17 2018-02-07 07:55:24,945:39896(0x70633000):ZOO_INFO@log_env@794: > Client > environment:user.dir=/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/1mHCvU > 15:55:17 @ 0x7fff4f84f312 abort > 15:55:17 2018-02-07 > 07:55:24,945:39896(0x70633000):ZOO_INFO@zookeeper_init@827: Initiating > client connection, host=127.0.0.1:52197 sessionTimeout=1 > watcher=0x10d916590 sessionId=0 sessionPasswd= context=0x7fe1bda706a0 > flags=0 > 15:55:17 @ 0x7fff4f817368 __assert_rtn > 15:55:17 @0x10b9cff97 _ZNR6OptionIN5mesos10MasterInfoEE3getEv > 15:55:17 @0x10bbb04b5 Option<>::operator->() > 15:55:17 @0x10bd4514a mesos::internal::master::Master::detected() > 15:55:17 @0x10bf54558 > _ZZN7process8dispatchIN5mesos8internal6master6MasterERKNS_6FutureI6OptionINS1_10MasterInfoSB_EEvRKNS_3PIDIT_EEMSD_FvT0_EOT1_ENKUlOS9_PNS_11ProcessBaseEE_clESM_SO_ > 15:55:17 @0x10bf54310 > _ZN5cpp176invokeIZN7process8dispatchIN5mesos8internal6master6MasterERKNS1_6FutureI6OptionINS3_10MasterInfoSD_EEvRKNS1_3PIDIT_EEMSF_FvT0_EOT1_EUlOSB_PNS1_11ProcessBaseEE_JSB_SQ_EEEDTclclsr3stdE7forwardISF_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSF_DpOSS_ > 15:55:17 @0x10bf542bb > _ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master6MasterERKNS2_6FutureI6OptionINS4_10MasterInfoSE_EEvRKNS2_3PIDIT_EEMSG_FvT0_EOT1_EUlOSC_PNS2_11ProcessBaseEE_JSC_NSt3__112placeholders4__phILi1E13invoke_expandISS_NST_5tupleIJSC_SW_EEENSZ_IJOSR_EEEJLm0ELm1DTclsr5cpp17E6invokeclsr3stdE7forwardISG_Efp_Espcl6expandclsr3stdE3getIXT2_EEclsr3stdE7forwardISK_Efp0_EEclsr3stdE7forwardISN_Efp2_OSG_OSK_N5cpp1416integer_sequenceImJXspT2_SO_ > 15:55:17 @0x10bf541f3 > _ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master6MasterERKNS2_6FutureI6OptionINS4_10MasterInfoSE_EEvRKNS2_3PIDIT_EEMSG_FvT0_EOT1_EUlOSC_PNS2_11ProcessBaseEE_JSC_NSt3__112placeholders4__phILi1EclIJSR_EEEDTcl13invoke_expandclL_ZNST_4moveIRSS_EEONST_16remove_referenceISG_E4typeEOSG_EdtdefpT1fEclL_ZNSZ_IRNST_5tupleIJSC_SW_ES14_S15_EdtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1_Eclsr3stdE16forward_as_tuplespclsr3stdE7forwardIT_Efp_DpOS1C_ > 15:55:17 @
[jira] [Created] (MESOS-8722) Hard-coded timeout for authentication failures
Benno Evers created MESOS-8722: -- Summary: Hard-coded timeout for authentication failures Key: MESOS-8722 URL: https://issues.apache.org/jira/browse/MESOS-8722 Project: Mesos Issue Type: Bug Reporter: Benno Evers In the mesos agent there is a hard-coded 5 second timeout for any authentication attempt: {noformat} void Slave::authenticate() { [...] delay(Seconds(5), self(), &Self::authenticationTimeout, authenticating.get()); } {noformat} When the network is poor, this can lead to the situation where an agent doesn't get to authorize for a long time, preventing it from re-joining the cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8721) Unnecessary cropping of agent id's in the web ui
Benno Evers created MESOS-8721: -- Summary: Unnecessary cropping of agent id's in the web ui Key: MESOS-8721 URL: https://issues.apache.org/jira/browse/MESOS-8721 Project: Mesos Issue Type: Bug Reporter: Benno Evers Attachments: cropped_ids.png As seen in the attached image (captured from Firefox 59 and Mesos 1.2.3), the agents page of the web ui appears to be cropping agent ids even if the column would have enough space to display the full name. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8720) CSIClientTest segfaults on macOS.
Jan Schlicht created MESOS-8720: --- Summary: CSIClientTest segfaults on macOS. Key: MESOS-8720 URL: https://issues.apache.org/jira/browse/MESOS-8720 Project: Mesos Issue Type: Bug Components: storage Affects Versions: 1.6.0 Environment: macOS 10.13.3, LLVM 6.0.0 Reporter: Jan Schlicht This seems to be caused by the changes introduced in commit {{79c21981803dafd8a5e971b98961487a69017ce9}}. On a macOS build, configured with {{--enable-grpc}}, all test cases in {{CSIClientTest}} segfault. Running {{src/mesos-tests --gtest_filter=\*CSIClientTest\*}} results in {noformat} [ RUN ] Identity/CSIClientTest.Call/Client_GetSupportedVersions mesos-tests(57309,0x7fffa0293340) malloc: *** error for object 0x10bb63b68: pointer being freed was not allocated *** set a breakpoint in malloc_error_break to debug *** Aborted at 1521711802 (unix time) try "date -d @1521711802" if you are using GNU date *** PC: @ 0x7fff6738ce3e __pthread_kill *** SIGABRT (@0x7fff6738ce3e) received by PID 57309 (TID 0x7fffa0293340) stack trace: *** @ 0x7fff674bef5a _sigtramp @0x0 (unknown) @ 0x7fff672e9312 abort @ 0x7fff673e6866 free @0x10aec51bd grpc::CompletionQueue::CompletionQueue() @0x10b2087a4 process::grpc::client::Runtime::Data::Data() @0x107bd697d mesos::internal::tests::CSIClientTest::CSIClientTest() @0x107bd68ca testing::internal::ParameterizedTestFactory<>::CreateTest() @0x107c58158 testing::internal::HandleExceptionsInMethodIfSupported<>() @0x107c57fd8 testing::TestInfo::Run() @0x107c588c7 testing::TestCase::Run() @0x107c612b7 testing::internal::UnitTestImpl::RunAllTests() @0x107c60d58 testing::internal::HandleExceptionsInMethodIfSupported<>() @0x107c60cc8 testing::UnitTest::Run() @0x106afc83d main @ 0x7fff6723d115 start @0x2 (unknown) Abort trap: 6 {noformat} Increasing GLog verbosity doesn't provide more information. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8718) Add the fields `ExposedPorts` and `Volumes` into Docker v1 image spec
[ https://issues.apache.org/jira/browse/MESOS-8718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16409303#comment-16409303 ] Qian Zhang commented on MESOS-8718: --- RR: https://reviews.apache.org/r/66211/ > Add the fields `ExposedPorts` and `Volumes` into Docker v1 image spec > - > > Key: MESOS-8718 > URL: https://issues.apache.org/jira/browse/MESOS-8718 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > > This ticket is to address the TODO below in the > [docker/v1.proto|https://github.com/apache/mesos/blob/1.5.0/include/mesos/docker/v1.proto#L70:L71]: > {code:java} > // TODO(gilbert): Create a message including string-message > // pair to match ExposedPorts' map (map[nat.Port]struct{}). > {code} > And similar to the field `ExposedPorts` mentioned in the above TODO, we > should also add the field `Volumes` which is also a string-message pair. > Once these two fields are added, we could consider to build features on top > of them in the `docker/runtime` isolator. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8719) Mesos compiled with `--enable-grpc` doesn't compile on non-Linux builds
Jan Schlicht created MESOS-8719: --- Summary: Mesos compiled with `--enable-grpc` doesn't compile on non-Linux builds Key: MESOS-8719 URL: https://issues.apache.org/jira/browse/MESOS-8719 Project: Mesos Issue Type: Bug Components: storage Affects Versions: 1.6.0 Environment: macOS Reporter: Jan Schlicht Assignee: Jan Schlicht Commit {{59cca968e04dee069e0df2663733b6d6f55af0da}} added {{examples/test_csi_plugin.cpp}} to non-Linux builds that are configured using the {{--enable-grpc}} flag. As {{examples/test_csi_plugin.cpp}} includes {{fs/linux.hpp}}, it can only compile on Linux and needs to be disabled for non-Linux builds. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8718) Add the fields `ExposedPorts` and `Volumes` into Docker v1 image spec
Qian Zhang created MESOS-8718: - Summary: Add the fields `ExposedPorts` and `Volumes` into Docker v1 image spec Key: MESOS-8718 URL: https://issues.apache.org/jira/browse/MESOS-8718 Project: Mesos Issue Type: Improvement Components: containerization Reporter: Qian Zhang Assignee: Qian Zhang This ticket is to address the TODO below in the [docker/v1.proto|https://github.com/apache/mesos/blob/1.5.0/include/mesos/docker/v1.proto#L70:L71]: {code:java} // TODO(gilbert): Create a message including string-message // pair to match ExposedPorts' map (map[nat.Port]struct{}). {code} And similar to the field `ExposedPorts` mentioned in the above TODO, we should also add the field `Volumes` which is also a string-message pair. Once these two fields are added, we could consider to build features on top of them in the `docker/runtime` isolator. -- This message was sent by Atlassian JIRA (v7.6.3#76005)