I used to meet a similar issue with Zookeeper + Messo; I resolved it by remove
127.0.1.1 from /etc/hosts; here is an example:
klaus@klaus-OptiPlex-780:~/Workspace/mesos$ cat /etc/hosts
127.0.0.1 localhost
127.0.1.1 klaus-OptiPlex-780 <<===== remove this line, and a new line:
mapping IP (e.g. 192.168.1.100) with hostname
...
BTW, please also clear-up the log directory and re-start ZK & Mesos.
If any more comments, please let me know.
Regards,----Klaus Ma (马达), PMP® | http://www.cguru.net
CallSend SMSCall from mobileAdd to SkypeYou'll need Skype CreditFree via
SkypeCallSend SMSCall from mobileAdd to SkypeYou'll need Skype CreditFree via
SkypeDate: Thu, 13 Aug 2015 12:20:34 -0700
Subject: Re: Can't start master properly (stale state issue?); help!
From: [email protected]
To: [email protected]
On Thu, Aug 13, 2015 at 11:53 AM, Paul Bell <[email protected]> wrote:
Marco & hasodent,
This is just a quick note to say thank you for your replies.
No problem, you're welcome. I will answer you much more fully tomorrow, but for
now can only manage a few quick observations & questions:
1. Having some months ago encountered a known problem with the IP@ 127.0.1.1
(I'll provide references tomorrow), I early on configured /etc/hosts, replacing
"myHostName 127.0.1.1" with "myHostName <Real_IP>". That said, I can't rule out
a race condition whereby ZK | mesos-master saw the original unchanged
/etc/hosts before I zapped it.
2. What is a znode and how would I drop it?
so, the znode is the fancy name that ZK gives to the nodes in its tree
(trivially, the "path") - assuming that you give Mesos the following ZK
URL:zk://10.10.0.5:2181/mesos/prod
the 'znode' would be `/mesos/prod` and you could go inspect it (using zkCli.sh)
by doing:> ls /mesos/prod
you should see at least one (with the Master running) file: info_0000001 or
json.info_00000001 (depending on whether you're running 0.23 or 0.24) and you
could then inspect its contents with:> get /mesos/prod/info_0000001
For example, if I run a Mesos 0.23 on my localhost, against ZK on the same:
$ ./bin/mesos-master.sh --zk=zk://localhost:2181/mesos/test --quorum=1
--work_dir=/tmp/m23-2 --port=5053I can connect to ZK via zkCli.sh and:
[zk: localhost:2181(CONNECTED) 4] ls /mesos/test
[info_0000000006, log_replicas]
[zk: localhost:2181(CONNECTED) 6] get /mesos/test/info_0000000006
#20150813-120952-18983104-5053-14072ц '"[email protected]:5053*....
192.168.33.120.23.0
cZxid = 0x314
dataLength = 93
.... // a bunch of other metadata
numChildren = 0
(you can remove it with - you guessed it - `rm -f /mesos/test` at the CLI
prompt - stop Mesos first, or it will be a very unhappy Master :).
in the corresponding logs I see (note the "new leader" here too, even though
this was the one and only):
I0813 12:09:52.126509 105455616 group.cpp:656] Trying to get
'/mesos/test/info_0000000006' in ZooKeeper
W0813 12:09:52.127071 107065344 detector.cpp:444] Leading master
[email protected]:5053 is using a Protobuf binary format when registering
with ZooKeeper (info): this will be deprecated as of Mesos 0.24 (see MESOS-2340)
I0813 12:09:52.127094 107065344 detector.cpp:481] A new leading master
([email protected]:5053) is detected
I0813 12:09:52.127187 103845888 master.cpp:1481] The newly elected leader is
[email protected]:5053 with id 20150813-120952-18983104-5053-14072
I0813 12:09:52.127209 103845888 master.cpp:1494] Elected as the leading master!
At this point, I'm almost sure you're running up against some issue with the
log-replica; but I'm the least competent guy here to help you on that one,
hopefully someone else will be able to add insight here.
I start the services (zk, master, marathon; all on same host) by SSHing into
the host & doing "service XXXX start" commands.
Again, thanks very much; and more tomorrow.
Cordially,
Paul
On Thu, Aug 13, 2015 at 1:08 PM, haosdent <[email protected]> wrote:
Hello, how you start the master? And could you try use "netstat -antp|grep
5050" to find whether there are multi master processes run at a same machine or
not?
On Thu, Aug 13, 2015 at 10:37 PM, Paul Bell <[email protected]> wrote:
Hi All,
I hope someone can shed some light on this because I'm getting desperate!
I try to start components zk, mesos-master, and marathon in that order. They
are started via a program that SSHs to the sole host and does "service xxx
start". Everyone starts happily enough. But the Mesos UI shows me:
This master is not the leader, redirecting in 0 seconds ... go now
The pattern seen in all of the mesos-master.INFO logs (one of which shown
below) is that the mesos-master with the correct IP@ starts. But then a "new
leader" is detected and becomes leading master. This new leader shows UPID
([email protected]:5050
I've tried clearing what ZK and mesos-master state I can find, but this problem
will not "go away".
Would someone be so kind as to a) explain what is happening here and b) suggest
remedies?
Thanks very much.
-Paul
Log file created at: 2015/08/13 10:19:43Running on machine: 71.100.14.9Log line
format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msgI0813 10:19:43.225636
2542 logging.cpp:172] INFO level logging started!I0813 10:19:43.235213 2542
main.cpp:181] Build: 2015-05-05 06:15:50 by rootI0813 10:19:43.235244 2542
main.cpp:183] Version: 0.22.1I0813 10:19:43.235257 2542 main.cpp:186] Git tag:
0.22.1I0813 10:19:43.235268 2542 main.cpp:190] Git SHA:
d6309f92a7f9af3ab61a878403e3d9c284ea87e0I0813 10:19:43.245098 2542
leveldb.cpp:176] Opened db in 9.386828msI0813 10:19:43.247138 2542
leveldb.cpp:183] Compacted db in 1.956669msI0813 10:19:43.247194 2542
leveldb.cpp:198] Created db iterator in 13961nsI0813 10:19:43.247206 2542
leveldb.cpp:204] Seeked to beginning of db in 677nsI0813 10:19:43.247215 2542
leveldb.cpp:273] Iterated through 0 keys in the db in 243nsI0813
10:19:43.247252 2542 replica.cpp:744] Replica recovered with log positions 0
-> 0 with 1 holes and 0 unlearnedI0813 10:19:43.248755 2611 log.cpp:238]
Attempting to join replica to ZooKeeper groupI0813 10:19:43.248924 2542
main.cpp:306] Starting Mesos masterI0813 10:19:43.249244 2612 recover.cpp:449]
Starting replica recoveryI0813 10:19:43.250239 2612 recover.cpp:475] Replica
is in EMPTY statusI0813 10:19:43.250819 2612 replica.cpp:641] Replica in EMPTY
status received a broadcasted recover requestI0813 10:19:43.251014 2607
recover.cpp:195] Received a recover response from a replica in EMPTY
statusI0813 10:19:43.249503 2542 master.cpp:349] Master
20150813-101943-151938119-5050-2542 (71.100.14.9) started on
71.100.14.9:5050I0813 10:19:43.252053 2610 recover.cpp:566] Updating replica
status to STARTINGI0813 10:19:43.252571 2542 master.cpp:397] Master allowing
unauthenticated frameworks to registerI0813 10:19:43.253159 2542
master.cpp:402] Master allowing unauthenticated slaves to registerI0813
10:19:43.254276 2612 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb
took 1.816161msI0813 10:19:43.254323 2612 replica.cpp:323] Persisted replica
status to STARTINGI0813 10:19:43.254905 2612 recover.cpp:475] Replica is in
STARTING statusI0813 10:19:43.255203 2612 replica.cpp:641] Replica in STARTING
status received a broadcasted recover requestI0813 10:19:43.255265 2612
recover.cpp:195] Received a recover response from a replica in STARTING
statusI0813 10:19:43.255343 2612 recover.cpp:566] Updating replica status to
VOTINGI0813 10:19:43.258730 2611 master.cpp:1295] Successfully attached file
'/var/log/mesos/mesos-master.INFO'I0813 10:19:43.258760 2609
contender.cpp:131] Joining the ZK groupI0813 10:19:43.258862 2612
leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 3.477458msI0813
10:19:43.258894 2612 replica.cpp:323] Persisted replica status to VOTINGI0813
10:19:43.258934 2612 recover.cpp:580] Successfully joined the Paxos groupI0813
10:19:43.258987 2612 recover.cpp:464] Recover process terminatedI0813
10:19:46.590340 2606 group.cpp:313] Group process (group(1)@71.100.14.9:5050)
connected to ZooKeeperI0813 10:19:46.590373 2606 group.cpp:790] Syncing group
operations: queue size (joins, cancels, datas) = (0, 0, 0)I0813 10:19:46.590386
2606 group.cpp:385] Trying to create path '/mesos/log_replicas' in
ZooKeeperI0813 10:19:46.591442 2606 network.hpp:424] ZooKeeper group
memberships changedI0813 10:19:46.591514 2606 group.cpp:659] Trying to get
'/mesos/log_replicas/0000000000' in ZooKeeperI0813 10:19:46.592146 2606
group.cpp:659] Trying to get '/mesos/log_replicas/0000000001' in ZooKeeperI0813
10:19:46.593128 2608 network.hpp:466] ZooKeeper group PIDs: {
log-replica(1)@127.0.1.1:5050 }I0813 10:19:46.593955 2608 group.cpp:313] Group
process (group(2)@71.100.14.9:5050) connected to ZooKeeperI0813 10:19:46.593977
2608 group.cpp:790] Syncing group operations: queue size (joins, cancels,
datas) = (1, 0, 0)I0813 10:19:46.593986 2608 group.cpp:385] Trying to create
path '/mesos/log_replicas' in ZooKeeperI0813 10:19:46.594894 2605
group.cpp:313] Group process (group(3)@71.100.14.9:5050) connected to
ZooKeeperI0813 10:19:46.594992 2605 group.cpp:790] Syncing group operations:
queue size (joins, cancels, datas) = (1, 0, 0)I0813 10:19:46.595007 2605
group.cpp:385] Trying to create path '/mesos' in ZooKeeperI0813 10:19:46.595654
2607 group.cpp:313] Group process (group(4)@71.100.14.9:5050) connected to
ZooKeeperI0813 10:19:46.595741 2607 group.cpp:790] Syncing group operations:
queue size (joins, cancels, datas) = (0, 0, 0)I0813 10:19:46.595785 2607
group.cpp:385] Trying to create path '/mesos' in ZooKeeperI0813 10:19:46.598635
2612 network.hpp:424] ZooKeeper group memberships changedI0813 10:19:46.598775
2612 group.cpp:659] Trying to get '/mesos/log_replicas/0000000000' in
ZooKeeperI0813 10:19:46.599954 2612 group.cpp:659] Trying to get
'/mesos/log_replicas/0000000001' in ZooKeeperI0813 10:19:46.600307 2611
contender.cpp:247] New candidate (id='3') has entered the contest for
leadershipI0813 10:19:46.600721 2612 group.cpp:659] Trying to get
'/mesos/log_replicas/0000000002' in ZooKeeperI0813 10:19:46.601297 2612
network.hpp:466] ZooKeeper group PIDs: { log-replica(1)@127.0.1.1:5050,
log-replica(1)@71.100.14.9:5050 }I0813 10:19:46.601752 2607 detector.cpp:138]
Detected a new leader: (id='0')I0813 10:19:46.601850 2611 group.cpp:659]
Trying to get '/mesos/info_0000000000' in ZooKeeperI0813 10:19:46.602330 2611
detector.cpp:452] A new leading master ([email protected]:5050) is
detectedI0813 10:19:46.602412 2607 master.cpp:1356] The newly elected leader
is [email protected]:5050 with id 20150813-101601-16842879-5050-6368I0813
10:19:58.542353 2611 http.cpp:516] HTTP request for '/master/state.json'I0813
10:19:59.457691 2612 http.cpp:516] HTTP request for '/master/state.json'I0813
10:20:00.355845 2606 http.cpp:516] HTTP request for '/master/state.json'I0813
10:20:06.577448 2609 http.cpp:352] HTTP request for '/master/redirect'
--
Best Regards,
Haosdent Huang