My 2cents - Is there a possibility of old data in /var/lib/mesos - can you try deleting the folder /var/lib/mesos in all the 3 systems and try bringing it up??
On Sat, Jun 4, 2016 at 9:04 PM, Qian Zhang <zhq527...@gmail.com> wrote: > I am using the latest Mesos code in git (master branch). However, I also > tried the official 0.28.1 release, but no lock too. > > > Thanks, > Qian Zhang > > On Sun, Jun 5, 2016 at 8:04 AM, Jie Yu <yujie....@gmail.com> wrote: > >> Which version are you using? >> >> - Jie >> >> On Sat, Jun 4, 2016 at 4:34 PM, Qian Zhang <zhq527...@gmail.com> wrote: >> >> > Thanks Vinod and Dick. >> > >> > I think my 3 ZK servers have formed a quorum, each of them has the >> > following config: >> > $ cat conf/zoo.cfg >> > server.1=192.168.122.132:2888:3888 >> > server.2=192.168.122.225:2888:3888 >> > server.3=192.168.122.171:2888:3888 >> > autopurge.purgeInterval=6 >> > autopurge.snapRetainCount=5 >> > initLimit=10 >> > syncLimit=5 >> > maxClientCnxns=0 >> > clientPort=2181 >> > tickTime=2000 >> > quorumListenOnAllIPs=true >> > dataDir=/home/stack/packages/zookeeper-3.4.8/snapshot >> > dataLogDir=/home/stack/packages/zookeeper-3.4.8/transactions >> > >> > And when I run "bin/zkServer.sh status" on each of them, I can see >> "Mode: >> > leader" for one, and "Mode: follower" for the other two. >> > >> > I have already tried to manually start 3 masters simultaneously, and >> here >> > is what I see in their log: >> > In 192.168.122.171(this is the first master I started): >> > I0605 07:12:49.418721 1187 detector.cpp:152] Detected a new leader: >> > (id='25') >> > I0605 07:12:49.419276 1186 group.cpp:698] Trying to get >> > '/mesos/log_replicas/0000000024' in ZooKeeper >> > I0605 07:12:49.420013 1188 group.cpp:698] Trying to get >> > '/mesos/json.info_0000000025' in ZooKeeper >> > I0605 07:12:49.423807 1188 zookeeper.cpp:259] A new leading master >> > (UPID=master@192.168.122.171:5050) is detected >> > I0605 07:12:49.423841 1186 network.hpp:461] ZooKeeper group PIDs: { >> > log-replica(1)@192.168.122.171:5050 } >> > I0605 07:12:49.424281 1187 master.cpp:1951] The newly elected >> leader >> > is master@192.168.122.171:5050 with id >> > cdc459d4-a05f-4f99-9bf4-1ee9a91d139b >> > I0605 07:12:49.424895 1187 master.cpp:1964] Elected as the leading >> > master! >> > >> > In 192.168.122.225 (second master I started): >> > I0605 07:12:51.918702 2246 detector.cpp:152] Detected a new leader: >> > (id='25') >> > I0605 07:12:51.919983 2246 group.cpp:698] Trying to get >> > '/mesos/json.info_0000000025' in ZooKeeper >> > I0605 07:12:51.921910 2249 network.hpp:461] ZooKeeper group PIDs: { >> > log-replica(1)@192.168.122.171:5050 } >> > I0605 07:12:51.925721 2252 replica.cpp:673] Replica in EMPTY status >> > received a broadcasted recover request from (6)@192.168.122.225:5050 >> > I0605 07:12:51.927891 2246 zookeeper.cpp:259] A new leading master >> > (UPID=master@192.168.122.171:5050) is detected >> > I0605 07:12:51.928444 2246 master.cpp:1951] The newly elected >> leader >> > is master@192.168.122.171:5050 with id >> > cdc459d4-a05f-4f99-9bf4-1ee9a91d139b >> > >> > In 192.168.122.132 (last master I started): >> > I0605 07:12:53.553949 16426 detector.cpp:152] Detected a new leader: >> > (id='25') >> > I0605 07:12:53.555179 16429 group.cpp:698] Trying to get >> > '/mesos/json.info_0000000025' in ZooKeeper >> > I0605 07:12:53.560045 16428 zookeeper.cpp:259] A new leading master >> (UPID= >> > master@192.168.122.171:5050) is detected >> > >> > So right after I started these 3 masters, the first one >> (192.168.122.171) >> > was successfully elected as leader, but after 60s, 192.168.122.171 >> failed >> > with the error mentioned in my first mail, and then 192.168.122.225 was >> > elected as leader, but it failed with the same error too after another >> 60s, >> > and the same thing happened to the last one (192.168.122.132). So after >> > about 180s, all my 3 master were down. >> > >> > I tried both: >> > sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos >> --quorum=2 >> > --work_dir=/var/lib/mesos/master >> > and >> > sudo ./bin/mesos-master.sh --zk=zk://192.168.122.132:2181, >> > 192.168.122.171:2181,192.168.122.225:2181/mesos --quorum=2 >> > --work_dir=/var/lib/mesos/master >> > And I see the same error for both. >> > >> > 192.168.122.132, 192.168.122.225 and 192.168.122.171 are 3 VMs which are >> > running on a KVM hypervisor host. >> > >> > >> > >> > >> > Thanks, >> > Qian Zhang >> > >> > On Sun, Jun 5, 2016 at 3:47 AM, Dick Davies <d...@hellooperator.net> >> > wrote: >> > >> >> You told the master it needed a quorum of 2 and it's the only one >> >> online, so it's bombing out. >> >> That's the expected behaviour. >> >> >> >> You need to start at least 2 zookeepers before it will be a functional >> >> group, same for the masters. >> >> >> >> You haven't mentioned how you setup your zookeeper cluster, so i'm >> >> assuming that's working >> >> correctly (3 nodes, all aware of the other 2 in their config). If not, >> >> you need to sort that out first. >> >> >> >> >> >> Also I think your zk URL is wrong - you want to list all 3 zookeeper >> >> nodes like this: >> >> >> >> sudo ./bin/mesos-master.sh >> >> --zk=zk://host1:2181,host2:2181,host3:2181/mesos --quorum=2 >> >> --work_dir=/var/lib/mesos/master >> >> >> >> when you've run that command on 2 hosts things should start working, >> >> you'll want all 3 up for >> >> redundancy. >> >> >> >> On 4 June 2016 at 16:42, Qian Zhang <zhq527...@gmail.com> wrote: >> >> > Hi Folks, >> >> > >> >> > I am trying to set up a Mesos HA env with 3 nodes, each of nodes has >> a >> >> > Zookeeper running, so they form a Zookeeper cluster. And then when I >> >> started >> >> > the first Mesos master in one node with: >> >> > sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos >> >> --quorum=2 >> >> > --work_dir=/var/lib/mesos/master >> >> > >> >> > I found it will hang here for 60 seconds: >> >> > I0604 23:39:56.488219 15330 zookeeper.cpp:259] A new leading master >> >> > (UPID=master@192.168.122.132:5050) is detected >> >> > I0604 23:39:56.489080 15337 master.cpp:1951] The newly elected >> leader >> >> is >> >> > master@192.168.122.132:5050 with id >> >> 40d387a6-4d61-49d6-af44-51dd41457390 >> >> > I0604 23:39:56.489791 15337 master.cpp:1964] Elected as the leading >> >> > master! >> >> > I0604 23:39:56.490401 15337 master.cpp:1651] Recovering from >> registrar >> >> > I0604 23:39:56.491706 15330 registrar.cpp:332] Recovering registrar >> >> > I0604 23:39:56.496448 15332 log.cpp:524] Attempting to start the >> >> writer >> >> > >> >> > And after 60s, master will fail: >> >> > F0604 23:40:56.499596 15337 master.cpp:1640] Recovery failed: Failed >> to >> >> > recover registrar: Failed to perform fetch within 1mins >> >> > *** Check failure stack trace: *** >> >> > @ 0x7f4b81372f4e google::LogMessage::Fail() >> >> > @ 0x7f4b81372e9a google::LogMessage::SendToLog() >> >> > @ 0x7f4b8137289c google::LogMessage::Flush() >> >> > @ 0x7f4b813757b0 google::LogMessageFatal::~LogMessageFatal() >> >> > @ 0x7f4b8040eea0 mesos::internal::master::fail() >> >> > @ 0x7f4b804dbeb3 >> >> > >> >> >> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEE6__callIvJS1_EJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE >> >> > @ 0x7f4b804ba453 >> >> > >> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEEclIJS1_EvEET0_DpOT_ >> >> > @ 0x7f4b804898d7 >> >> > >> >> >> _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1EEEEvEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_ >> >> > @ 0x7f4b804dbf80 >> >> > >> >> >> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_ >> >> > @ 0x49d257 std::function<>::operator()() >> >> > @ 0x49837f >> >> > >> >> >> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_ >> >> > @ 0x493024 process::Future<>::fail() >> >> > @ 0x7f4b8015ad20 process::Promise<>::fail() >> >> > @ 0x7f4b804d9295 process::internal::thenf<>() >> >> > @ 0x7f4b8051788f >> >> > >> >> >> _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12_PlaceholderILi1EEEE6__callIvISM_EILm0ELm1ELm2EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE >> >> > @ 0x7f4b8050fa3b std::_Bind<>::operator()<>() >> >> > @ 0x7f4b804f94e3 std::_Function_handler<>::_M_invoke() >> >> > @ 0x7f4b8050fc69 std::function<>::operator()() >> >> > @ 0x7f4b804f9609 >> >> > >> >> >> _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_ >> >> > @ 0x7f4b80517936 >> >> > >> >> >> _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryEEEEZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeERKSt9_Any_dataS7_ >> >> > @ 0x7f4b8050fc69 std::function<>::operator()() >> >> > @ 0x7f4b8056b1b4 process::internal::run<>() >> >> > @ 0x7f4b80561672 process::Future<>::fail() >> >> > @ 0x7f4b8059bf5f std::_Mem_fn<>::operator()<>() >> >> > @ 0x7f4b8059757f >> >> > >> >> >> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEE6__callIbIS8_EILm0ELm1EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE >> >> > @ 0x7f4b8058fad1 >> >> > >> >> >> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEEclIJS8_EbEET0_DpOT_ >> >> > @ 0x7f4b80585a41 >> >> > >> >> >> _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1EEEEbEERKS4_OT_NS4_6PreferEENUlS9_E_clES9_ >> >> > @ 0x7f4b80597605 >> >> > >> >> >> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS8_FbS1_EES8_St12_PlaceholderILi1EEEEbEERKS8_OT_NS8_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_ >> >> > @ 0x49d257 std::function<>::operator()() >> >> > @ 0x49837f >> >> > >> >> >> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_ >> >> > @ 0x7f4b8056164a process::Future<>::fail() >> >> > @ 0x7f4b8055a378 process::Promise<>::fail() >> >> > >> >> > I tried both Zookeeper 3.4.8 and 3.4.6 with latest code of Mesos, >> but no >> >> > luck for both. Any ideas about what happened? Thanks. >> >> > >> >> > >> >> > >> >> > Thanks, >> >> > Qian Zhang >> >> >> > >> > >> > > -- ever tried. ever failed. no matter. try again. fail again. fail better. -- Samuel Beckett