Re: Review Request 39060: Create master detector per url & not per framework.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/39060/ --- (Updated Jan. 5, 2016, 9:01 a.m.) Review request for mesos and Joris Van Remoortere. Summary (updated) - Create master detector per url & not per framework. Bugs: MESOS-3595 https://issues.apache.org/jira/browse/MESOS-3595 Repository: mesos Description --- If the number of framework created exceeds the lib process threads then during master failover the zookeeper updates can cause deadlock. Diffs - include/mesos/scheduler.hpp 049c041286f3167e79cc5ea8a9e0bf8d42569832 src/sched/sched.cpp 44eb4f50e8ed84297268d94a3a0320c843ff6d8c src/tests/mesos.hpp a4811b0d8dee33ff2ca4968f532ce64b7ea95249 Diff: https://reviews.apache.org/r/39060/diff/ Testing --- make check successful Created 100 framework instances on a 24 CPU machine. Master failover detected by the framework process and continue to work as expected. Thanks, Mandeep Chadha
Re: Review Request 39060: Create master detector per url & not per framework
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/39060/ --- (Updated Jan. 5, 2016, 8:48 a.m.) Review request for mesos and Joris Van Remoortere. Bugs: MESOS-3595 https://issues.apache.org/jira/browse/MESOS-3595 Repository: mesos Description --- If the number of framework created exceeds the lib process threads then during master failover the zookeeper updates can cause deadlock. Diffs (updated) - include/mesos/scheduler.hpp 049c041286f3167e79cc5ea8a9e0bf8d42569832 src/sched/sched.cpp 44eb4f50e8ed84297268d94a3a0320c843ff6d8c src/tests/mesos.hpp a4811b0d8dee33ff2ca4968f532ce64b7ea95249 Diff: https://reviews.apache.org/r/39060/diff/ Testing --- make check successful Created 100 framework instances on a 24 CPU machine. Master failover detected by the framework process and continue to work as expected. Thanks, Mandeep Chadha
Re: Review Request 39060: Create master detector per url & not per framework
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/39060/ --- (Updated Jan. 5, 2016, 3:17 a.m.) Review request for mesos and Joris Van Remoortere. Bugs: MESOS-3595 https://issues.apache.org/jira/browse/MESOS-3595 Repository: mesos Description --- If the number of framework created exceeds the lib process threads then during master failover the zookeeper updates can cause deadlock. Diffs (updated) - include/mesos/scheduler.hpp 049c041286f3167e79cc5ea8a9e0bf8d42569832 src/sched/sched.cpp 44eb4f50e8ed84297268d94a3a0320c843ff6d8c src/tests/mesos.hpp a4811b0d8dee33ff2ca4968f532ce64b7ea95249 Diff: https://reviews.apache.org/r/39060/diff/ Testing --- make check successful Created 100 framework instances on a 24 CPU machine. Master failover detected by the framework process and continue to work as expected. Thanks, Mandeep Chadha
Re: Review Request 40730: Fixing MESOS-3552 by using CHECK_NEAR to avoid errors due to double precision errors.
> On Nov. 26, 2015, 3:14 p.m., Neil Conway wrote: > > src/common/resources.cpp, line 951 > > <https://reviews.apache.org/r/40730/diff/1/?file=1147236#file1147236line951> > > > > This definitely needs an explanatory comment. > > Bernd Mathiske wrote: > I'll propose a patch that addresses these issues. What if results or cpus returns isNone or one of them is isNone ( Option ) , then get() is going to crash. CHECK_NEAR() does the same thing as the previous review request and this is the history of previous review request : "CHECK_DOUBLE_EQ will also fail. F0930 18:15:38.169140 26984 resources.cpp:874] Check failed: (result.cpus().get()) >= (cpus().get())-0.001L (24 vs. 24) Check failure stack trace: F0930 18:15:38.169322 26991 resources.cpp:874] Check failed: (result.cpus().get()) >= (cpus().get())-0.001L (24 vs. 24) Check failure stack trace: CHECK_NEAR() is the right Macro to use. But we also need to check (isNone() and isNone()) equality and hence the present implementation. I think the long term fix is to wrap double into Double and have the opeartor== to the right thing. " Using CHECK_NEAR() is fine but not checking other state of Option<> may lead to crash and is not complete. - Mandeep --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/40730/#review108140 --- On Nov. 26, 2015, 6:52 a.m., Avinash sridharan wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/40730/ > --- > > (Updated Nov. 26, 2015, 6:52 a.m.) > > > Review request for mesos, Bernd Mathiske, Klaus Ma, and Neil Conway. > > > Bugs: MESOS-3552 > https://issues.apache.org/jira/browse/MESOS-3552 > > > Repository: mesos > > > Description > --- > > A version of this fix was already proposed by Mandeep Chadha (@mchadha) and > submitted to the reivew board https://reviews.apache.org/r/39056/ . So the > main credit should go to him for this fix. > > > Diffs > - > > src/common/resources.cpp b4abf5405039d7d0a5028ccf034ad2e9623d064c > > Diff: https://reviews.apache.org/r/40730/diff/ > > > Testing > --- > > Added Mandeep's test case to GTEST https://reviews.apache.org/r/39056/ and > enabled an existing precision test (https://reviews.apache.org/r/40732/) to > test this fix. > > > Thanks, > > Avinash sridharan > >
Re: Review Request 39056: Fix for Mesos master crash due to check failure.
> On Oct. 7, 2015, 1:03 a.m., Guangya Liu wrote: > > src/common/resources.cpp, line 879 > > <https://reviews.apache.org/r/39056/diff/3/?file=1092072#file1092072line879> > > > > The meos is now using 0.01 as the MIN_CPUS > > Guangya Liu wrote: > As the mesos is using 0.01 as the MIN_CPUS, I think it is OK using 0.01 > as the epsilon Are we ok with changes ? Friendly ping :) - Mandeep --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/39056/#review101729 ------- On Oct. 6, 2015, 10:07 p.m., Mandeep Chadha wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/39056/ > --- > > (Updated Oct. 6, 2015, 10:07 p.m.) > > > Review request for mesos and Neil Conway. > > > Bugs: MESOS-3552 > https://issues.apache.org/jira/browse/MESOS-3552 > > > Repository: mesos > > > Description > --- > > Check failed due to double comparison : MESOS-3552. > > > Diffs > - > > src/common/resources.cpp 601388c35a1bff37c58e753d1870d53b8d0af2d1 > src/tests/reservation_tests.cpp 6b7c43c8b5c64618249dbee926383242320c111e > src/v1/resources.cpp dc868903472f8f3a1ddc56092e3f8f81d953ce39 > > Diff: https://reviews.apache.org/r/39056/diff/ > > > Testing > --- > > Added unit test. > make check successful. > > > Thanks, > > Mandeep Chadha > >
Re: Review Request 39060: Create master detector per url & not per framework
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/39060/ --- (Updated Nov. 3, 2015, 9:32 a.m.) Review request for mesos and Joris Van Remoortere. Bugs: MESOS-3595 https://issues.apache.org/jira/browse/MESOS-3595 Repository: mesos Description --- If the number of framework created exceeds the lib process threads then during master failover the zookeeper updates can cause deadlock. Diffs (updated) - include/mesos/scheduler.hpp f571d42d1508632152473c4f4ade60ae3900fce1 src/sched/sched.cpp 9c5e3b8e42605f4647d897ba02ea3a17e494f355 Diff: https://reviews.apache.org/r/39060/diff/ Testing --- make check successful Created 100 framework instances on a 24 CPU machine. Master failover detected by the framework process and continue to work as expected. Thanks, Mandeep Chadha
Re: Review Request 39060: Create master detector per url & not per framework
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/39060/ --- (Updated Nov. 3, 2015, 6:06 p.m.) Review request for mesos and Joris Van Remoortere. Bugs: MESOS-3595 https://issues.apache.org/jira/browse/MESOS-3595 Repository: mesos Description --- If the number of framework created exceeds the lib process threads then during master failover the zookeeper updates can cause deadlock. Diffs (updated) - include/mesos/scheduler.hpp f571d42d1508632152473c4f4ade60ae3900fce1 src/sched/sched.cpp 9c5e3b8e42605f4647d897ba02ea3a17e494f355 Diff: https://reviews.apache.org/r/39060/diff/ Testing --- make check successful Created 100 framework instances on a 24 CPU machine. Master failover detected by the framework process and continue to work as expected. Thanks, Mandeep Chadha
Re: Review Request 39060: Create master detector per url & not per framework
> On Oct. 21, 2015, 12:41 a.m., Joris Van Remoortere wrote: > > Hey Mandeep, Thanks for taking on this important work! > > Some high level questions: > > - Do we need to do reference counting manually or can we use a construct > > like `Shared` in Libprocess that wraps the `std::shared_ptr`. > > - Do we want to surface the singleton pattern in this way? What are the > > impacts of this on the user (i.e. your logic calling the `acquire` / > > `release` code currently? Does the user care about holding a reference to > > the singleton? > > > > Regardless, I added some style comments as they are valuable for you as a > > contributor. Thanks Joris for the comments. Let me get that done and upload the new diff. - Mandeep --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/39060/#review103331 ------- On Oct. 8, 2015, 5:05 p.m., Mandeep Chadha wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/39060/ > --- > > (Updated Oct. 8, 2015, 5:05 p.m.) > > > Review request for mesos and Joris Van Remoortere. > > > Bugs: MESOS-3595 > https://issues.apache.org/jira/browse/MESOS-3595 > > > Repository: mesos > > > Description > --- > > If the number of framework created exceeds the lib process > threads then during master failover the zookeeper updates can > cause deadlock. > > > Diffs > - > > src/sched/sched.cpp 571e00d303009a940f17c8ed4582749a718e846d > > Diff: https://reviews.apache.org/r/39060/diff/ > > > Testing > --- > > make check successful > Created 100 framework instances on a 24 CPU machine. Master failover detected > by the framework process and continue to work as expected. > > > Thanks, > > Mandeep Chadha > >
Re: Review Request 39060: Create master detector per url & not per framework
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/39060/ --- (Updated Oct. 8, 2015, 5:05 p.m.) Review request for mesos and Joris Van Remoortere. Bugs: MESOS-3595 https://issues.apache.org/jira/browse/MESOS-3595 Repository: mesos Description --- If the number of framework created exceeds the lib process threads then during master failover the zookeeper updates can cause deadlock. Diffs (updated) - src/sched/sched.cpp 571e00d303009a940f17c8ed4582749a718e846d Diff: https://reviews.apache.org/r/39060/diff/ Testing --- make check successful Created 100 framework instances on a 24 CPU machine. Master failover detected by the framework process and continue to work as expected. Thanks, Mandeep Chadha
Re: Review Request 39060: Create master detector per url & not per framework
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/39060/ --- (Updated Oct. 7, 2015, 9:42 p.m.) Review request for mesos and Joris Van Remoortere. Bugs: MESOS-3595 https://issues.apache.org/jira/browse/MESOS-3595 Repository: mesos Description --- If the number of framework created exceeds the lib process threads then during master failover the zookeeper updates can cause deadlock. Diffs (updated) - src/sched/sched.cpp 571e00d303009a940f17c8ed4582749a718e846d src/slave/containerizer/provisioner/docker/registry_client.hpp fdb68b675582f603ffb3e96f31c1c9405e238567 src/slave/containerizer/provisioner/docker/registry_client.cpp 4931ae8869a697b1e9d8d4cbc0a871e7cd506285 Diff: https://reviews.apache.org/r/39060/diff/ Testing --- make check successful Created 100 framework instances on a 24 CPU machine. Master failover detected by the framework process and continue to work as expected. Thanks, Mandeep Chadha
Re: Review Request 39060: Create master detector per url & not per framework
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/39060/ --- (Updated Oct. 7, 2015, 9:49 p.m.) Review request for mesos and Joris Van Remoortere. Bugs: MESOS-3595 https://issues.apache.org/jira/browse/MESOS-3595 Repository: mesos Description --- If the number of framework created exceeds the lib process threads then during master failover the zookeeper updates can cause deadlock. Diffs (updated) - src/sched/sched.cpp 571e00d303009a940f17c8ed4582749a718e846d Diff: https://reviews.apache.org/r/39060/diff/ Testing --- make check successful Created 100 framework instances on a 24 CPU machine. Master failover detected by the framework process and continue to work as expected. Thanks, Mandeep Chadha
Review Request 39056: Fix for Mesos master crash due to check failure.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/39056/ --- Review request for mesos. Bugs: MESOS-3552 https://issues.apache.org/jira/browse/MESOS-3552 Repository: mesos Description --- Check failed due to double comparison : MESOS-3552. Diffs - src/common/resources.cpp 601388c35a1bff37c58e753d1870d53b8d0af2d1 src/tests/reservation_tests.cpp 6b7c43c8b5c64618249dbee926383242320c111e Diff: https://reviews.apache.org/r/39056/diff/ Testing --- Added unit test. make check successful. Thanks, Mandeep Chadha
Review Request 39060: Create master detector per url & not per framework
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/39060/ --- Review request for mesos. Bugs: MESOS-3595 https://issues.apache.org/jira/browse/MESOS-3595 Repository: mesos Description --- If the number of framework created exceeds the lib process threads then during master failover the zookeeper updates can cause deadlock. Diffs - src/sched/sched.cpp 571e00d303009a940f17c8ed4582749a718e846d Diff: https://reviews.apache.org/r/39060/diff/ Testing --- make check successful Created 100 framework instances on a 24 CPU machine. Master failover detected by the framework process and continue to work as expected. Thanks, Mandeep Chadha
Re: Review Request 39056: Fix for Mesos master crash due to check failure.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/39056/ --- (Updated Oct. 6, 2015, 9:33 p.m.) Review request for mesos and Neil Conway. Bugs: MESOS-3552 https://issues.apache.org/jira/browse/MESOS-3552 Repository: mesos Description --- Check failed due to double comparison : MESOS-3552. Diffs (updated) - src/common/resources.cpp 601388c35a1bff37c58e753d1870d53b8d0af2d1 src/tests/reservation_tests.cpp 6b7c43c8b5c64618249dbee926383242320c111e src/v1/resources.cpp dc868903472f8f3a1ddc56092e3f8f81d953ce39 Diff: https://reviews.apache.org/r/39056/diff/ Testing --- Added unit test. make check successful. Thanks, Mandeep Chadha
Re: Review Request 39056: Fix for Mesos master crash due to check failure.
> On Oct. 6, 2015, 6:34 p.m., Neil Conway wrote: > > src/common/resources.cpp, line 874 > > <https://reviews.apache.org/r/39056/diff/1/?file=1091951#file1091951line874> > > > > This change should also be applied to Resources::apply() in > > src/v1/resources.cpp Thanks Neil. - Mandeep --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/39056/#review101674 ------- On Oct. 6, 2015, 9:33 p.m., Mandeep Chadha wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/39056/ > --- > > (Updated Oct. 6, 2015, 9:33 p.m.) > > > Review request for mesos and Neil Conway. > > > Bugs: MESOS-3552 > https://issues.apache.org/jira/browse/MESOS-3552 > > > Repository: mesos > > > Description > --- > > Check failed due to double comparison : MESOS-3552. > > > Diffs > - > > src/common/resources.cpp 601388c35a1bff37c58e753d1870d53b8d0af2d1 > src/tests/reservation_tests.cpp 6b7c43c8b5c64618249dbee926383242320c111e > src/v1/resources.cpp dc868903472f8f3a1ddc56092e3f8f81d953ce39 > > Diff: https://reviews.apache.org/r/39056/diff/ > > > Testing > --- > > Added unit test. > make check successful. > > > Thanks, > > Mandeep Chadha > >
Re: Review Request 39056: Fix for Mesos master crash due to check failure.
> On Oct. 6, 2015, 9:47 p.m., Jie Yu wrote: > > src/common/resources.cpp, line 880 > > <https://reviews.apache.org/r/39056/diff/2/?file=1092068#file1092068line880> > > > > Can you introduce a `CHECK_DOUBLE_EQ` in `stout/check.hpp`, similar to > > > > https://code.google.com/p/googletest/wiki/AdvancedGuide#Floating-Point_Macros > > > > To choose a default epsilon, you probably want to check the above link. > > Gtest uses "Units in the Last Place (ULPs)" as the default. > > Mandeep Chadha wrote: > CHECK_DOUBLE_EQ will also fail. > > F0930 18:15:38.169140 26984 resources.cpp:874] Check failed: > (result.cpus().get()) >= (cpus().get())-0.001L (24 vs. 24) *** > Check failure stack trace: *** F0930 18:15:38.169322 26991 resources.cpp:874] > Check failed: (result.cpus().get()) >= (cpus().get())-0.001L (24 > vs. 24) *** Check failure stack trace: *** > > CHECK_NEAR() is the right Macro to use. > > > But we also need to check (isNone() and isNone()) equality and hence the > present implementation. I think the long term fix is to wrap double into > Double and have the opeartor== to the right thing. > > Jie Yu wrote: > How do you implemente CHECK_DOUBLE_EQ? You copied from gtest code? you can just call it: CHECK_DOUBLE_EQ(result.cpus.get(), cpus.get()); src/common/resources.cpp ( includes the glog/logging.h and all the Macros are available). #include - Mandeep --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/39056/#review101691 --- On Oct. 6, 2015, 10:07 p.m., Mandeep Chadha wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/39056/ > --- > > (Updated Oct. 6, 2015, 10:07 p.m.) > > > Review request for mesos and Neil Conway. > > > Bugs: MESOS-3552 > https://issues.apache.org/jira/browse/MESOS-3552 > > > Repository: mesos > > > Description > --- > > Check failed due to double comparison : MESOS-3552. > > > Diffs > - > > src/common/resources.cpp 601388c35a1bff37c58e753d1870d53b8d0af2d1 > src/tests/reservation_tests.cpp 6b7c43c8b5c64618249dbee926383242320c111e > src/v1/resources.cpp dc868903472f8f3a1ddc56092e3f8f81d953ce39 > > Diff: https://reviews.apache.org/r/39056/diff/ > > > Testing > --- > > Added unit test. > make check successful. > > > Thanks, > > Mandeep Chadha > >
Re: Review Request 39056: Fix for Mesos master crash due to check failure.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/39056/ --- (Updated Oct. 6, 2015, 10:07 p.m.) Review request for mesos and Neil Conway. Bugs: MESOS-3552 https://issues.apache.org/jira/browse/MESOS-3552 Repository: mesos Description --- Check failed due to double comparison : MESOS-3552. Diffs (updated) - src/common/resources.cpp 601388c35a1bff37c58e753d1870d53b8d0af2d1 src/tests/reservation_tests.cpp 6b7c43c8b5c64618249dbee926383242320c111e src/v1/resources.cpp dc868903472f8f3a1ddc56092e3f8f81d953ce39 Diff: https://reviews.apache.org/r/39056/diff/ Testing --- Added unit test. make check successful. Thanks, Mandeep Chadha