Re: Can't start master properly (stale state issue?); help!
this thread was by wiping out the master VM and starting over with a clean VM. That is, stopping/destroying/creating (even rebooting) the cluster had no effect.* I suspect that, sooner or later, I will again hit this problem (probably sooner!). And I want to understand how best to handle it. Such an occurrence could be pretty awkward at a customer site. Thanks for all your help. Cordially, Paul On Thu, Aug 13, 2015 at 9:41 PM, Klaus Ma kl...@cguru.net wrote: I used to meet a similar issue with Zookeeper + Messo; I resolved it by remove 127.0.1.1 from /etc/hosts; here is an example: klaus@klaus-OptiPlex-780:~/Workspace/mesos$ cat /etc/hosts 127.0.0.1 localhost 127.0.1.1 klaus-OptiPlex-780 *= remove this line, and a new line: mapping IP (e.g. 192.168.1.100) with hostname* ... BTW, please also clear-up the log directory and re-start ZK Mesos. If any more comments, please let me know. Regards, Klaus Ma (马达), PMP® | http://www.cguru.net Call Send SMS Call from mobile Add to Skype You'll need Skype CreditFree via Skype Call Send SMS Call from mobile Add to Skype You'll need Skype CreditFree via Skype -- Date: Thu, 13 Aug 2015 12:20:34 -0700 Subject: Re: Can't start master properly (stale state issue?); help! From: ma...@mesosphere.io To: user@mesos.apache.org On Thu, Aug 13, 2015 at 11:53 AM, Paul Bell arach...@gmail.com wrote: Marco hasodent, This is just a quick note to say thank you for your replies. No problem, you're welcome. I will answer you much more fully tomorrow, but for now can only manage a few quick observations questions: 1. Having some months ago encountered a known problem with the IP@ 127.0.1.1 (I'll provide references tomorrow), I early on configured /etc/hosts, replacing myHostName 127.0.1.1 with myHostName Real_IP. That said, I can't rule out a race condition whereby ZK | mesos-master saw the original unchanged /etc/hosts before I zapped it. 2. What is a znode and how would I drop it? so, the znode is the fancy name that ZK gives to the nodes in its tree (trivially, the path) - assuming that you give Mesos the following ZK URL: zk://10.10.0.5:2181/mesos/prod the 'znode' would be `/mesos/prod` and you could go inspect it (using zkCli.sh) by doing: ls /mesos/prod you should see at least one (with the Master running) file: info_001 or json.info_0001 (depending on whether you're running 0.23 or 0.24) and you could then inspect its contents with: get /mesos/prod/info_001 For example, if I run a Mesos 0.23 on my localhost, against ZK on the same: $ ./bin/mesos-master.sh --zk=zk://localhost:2181/mesos/test --quorum=1 --work_dir=/tmp/m23-2 --port=5053 I can connect to ZK via zkCli.sh and: [zk: localhost:2181(CONNECTED) 4] ls /mesos/test [info_06, log_replicas] [zk: localhost:2181(CONNECTED) 6] get /mesos/test/info_06 #20150813-120952-18983104-5053-14072ц 'master@192.168.33.1:5053 * 192.168.33.120.23.0 cZxid = 0x314 dataLength = 93 // a bunch of other metadata numChildren = 0 (you can remove it with - you guessed it - `rm -f /mesos/test` at the CLI prompt - stop Mesos first, or it will be a very unhappy Master :). in the corresponding logs I see (note the new leader here too, even though this was the one and only): I0813 12:09:52.126509 105455616 group.cpp:656] Trying to get '/mesos/test/info_06' in ZooKeeper W0813 12:09:52.127071 107065344 detector.cpp:444] Leading master master@192.168.33.1:5053 http://192.168.33.1:5053 is using a Protobuf binary format when registering with ZooKeeper (info): this will be deprecated as of Mesos 0.24 (see MESOS-2340) I0813 12:09:52.127094 107065344 detector.cpp:481] A new leading master (UPID=master@192.168.33.1:5053 http://192.168.33.1:5053) is detected I0813 12:09:52.127187 103845888 master.cpp:1481] The newly elected leader is master@192.168.33.1:5053 http://192.168.33.1:5053 with id 20150813-120952-18983104-5053-14072 I0813 12:09:52.127209 103845888 master.cpp:1494] Elected as the leading master! At this point, I'm almost sure you're running up against some issue with the log-replica; but I'm the least competent guy here to help you on that one, hopefully someone else will be able to add insight here. I start the services (zk, master, marathon; all on same host) by SSHing into the host doing service start commands. Again, thanks very much; and more tomorrow. Cordially, Paul On Thu, Aug 13, 2015 at 1:08 PM, haosdent haosd...@gmail.com wrote: Hello, how you start the master? And could you try use netstat -antp|grep 5050 to find whether there are multi master processes run at a same machine or not? On Thu, Aug 13, 2015 at 10:37 PM, Paul Bell arach...@gmail.com wrote: Hi All, I hope someone can shed some light on this because I'm getting desperate! I try to start components zk, mesos-master, and marathon
Re: Can't start master properly (stale state issue?); help!
All, By way of some background: I'm not running a data center (or centers). Rather, I work on a distributed application whose trajectory is taking it into a realm of many Docker containers distributed across many hosts (mostly virtual hosts at the outset). An environment that supports isolation, multi-tenancy, scalability, and some fault tolerance is desirable for this application. Also, the mere ability to simplify - at least somewhat - the management of multiple hosts is of great importance. So, that's more or less how I got to Mesos and to here... I ended up writing a Java program that configures a collection of host VMs as a Mesos cluster and then, via Marathon, distributes the application containers across the cluster. Configuring building the cluster is largely a lot of SSH work. Doing the same for the application is part Marathon, part Docker remote API. The containers that need to talk to each other via TCP are connected with Weave's (http://weave.works) overlay network. So the main infrastructure consists of Mesos, Docker, and Weave. The whole thing is pretty amazing - for which I take very little credit. Rather, these are some wonderful technologies, and the folks who write support them are very helpful. That said, I sometimes feel like I'm juggling chain saws! *In re* the issues raised on this thread: All Mesos components were installed via the Mesosphere packages. The 4 VMs in the cluster are all running Ubuntu 14.04 LTS. My suspicions about the IP@ 127.0.1.1 were raised a few months ago when, after seeing this IP in a mesos-master log when things weren't working, I discovered these articles: https://groups.google.com/forum/#!topic/marathon-framework/1qboeZTOLU4 https://groups.google.com/forum/%20/l%20!topic/marathon-framework/1qboeZTOLU4 *http://frankhinek.com/build-mesos-multi-node-ha-cluster/ http://frankhinek.com/build-mesos-multi-node-ha-cluster/%20/l%20note2* (see note 2) So, to the point raised just now by Klaus (and earlier in the thread), the aforementioned configuration program does change /etc/hosts (and /etc/hostname) in the way Klaus suggested. But, as I mentioned to Marco hasodent, I might have encountered a race condition wherein ZK mesos-master saw the unchanged /etc/hosts before I altered it. I believe that I yesterday fixed that issue. Also, as part of the cluster create step, I get a bit aggressive (perhaps unwisely) with what I believe are some state repositories. Specifically, I rm /var/lib/zookeeper/version-2/* rm -Rf /var/lib/mesos/replicated_log Should I NOT be doing this? I know from experience that zapping the version-2 directory (ZK's data_Dir if IIRC) can solve occasional weirdness. Marco is /var/lib/mesos/replicated_log what you are referring to when you say some issue with the log-replica? Just a day or two ago I first heard the term znode learned a little about zkCli.sh. I will experiment with it more in the coming days. As matters now stand, I have the cluster up and running. But before I again deploy the application, I am trying to put the cluster through its paces by periodically cycling it through the states my program can bring about, e.g., --cluster create (takes a clean VM and configures it to act as one or more Mesos components: ZK, master, slave) --cluster stop(stops the Mesos services on each node) --cluster destroy (configures the VM back to its original clean state) --cluster create --cluster stop --cluster start et cetera. *The only way I got rid of the no leading master issue that started this thread was by wiping out the master VM and starting over with a clean VM. That is, stopping/destroying/creating (even rebooting) the cluster had no effect.* I suspect that, sooner or later, I will again hit this problem (probably sooner!). And I want to understand how best to handle it. Such an occurrence could be pretty awkward at a customer site. Thanks for all your help. Cordially, Paul On Thu, Aug 13, 2015 at 9:41 PM, Klaus Ma kl...@cguru.net wrote: I used to meet a similar issue with Zookeeper + Messo; I resolved it by remove 127.0.1.1 from /etc/hosts; here is an example: klaus@klaus-OptiPlex-780:~/Workspace/mesos$ cat /etc/hosts 127.0.0.1 localhost 127.0.1.1 klaus-OptiPlex-780 *= remove this line, and a new line: mapping IP (e.g. 192.168.1.100) with hostname* ... BTW, please also clear-up the log directory and re-start ZK Mesos. If any more comments, please let me know. Regards, Klaus Ma (马达), PMP® | http://www.cguru.net Call Send SMS Call from mobile Add to Skype You'll need Skype CreditFree via Skype Call Send SMS Call from mobile Add to Skype You'll need Skype CreditFree via Skype -- Date: Thu, 13 Aug 2015 12:20:34 -0700 Subject: Re: Can't start master properly (stale state issue?); help! From: ma...@mesosphere.io To: user@mesos.apache.org On Thu, Aug 13, 2015 at 11:53 AM, Paul Bell arach...@gmail.com
Re: Can't start master properly (stale state issue?); help!
Hello, how you start the master? And could you try use netstat -antp|grep 5050 to find whether there are multi master processes run at a same machine or not? On Thu, Aug 13, 2015 at 10:37 PM, Paul Bell arach...@gmail.com wrote: Hi All, I hope someone can shed some light on this because I'm getting desperate! I try to start components zk, mesos-master, and marathon in that order. They are started via a program that SSHs to the sole host and does service xxx start. Everyone starts happily enough. But the Mesos UI shows me: *This master is not the leader, redirecting in 0 seconds ... go now* The pattern seen in all of the mesos-master.INFO logs (one of which shown below) is that the mesos-master with the correct IP@ starts. But then a new leader is detected and becomes leading master. This new leader shows UPID *(UPID=master@127.0.1.1:5050 http://master@127.0.1.1:5050* I've tried clearing what ZK and mesos-master state I can find, but this problem will not go away. Would someone be so kind as to a) explain what is happening here and b) suggest remedies? Thanks very much. -Paul Log file created at: 2015/08/13 10:19:43 Running on machine: 71.100.14.9 Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg I0813 10:19:43.225636 2542 logging.cpp:172] INFO level logging started! I0813 10:19:43.235213 2542 main.cpp:181] Build: 2015-05-05 06:15:50 by root I0813 10:19:43.235244 2542 main.cpp:183] Version: 0.22.1 I0813 10:19:43.235257 2542 main.cpp:186] Git tag: 0.22.1 I0813 10:19:43.235268 2542 main.cpp:190] Git SHA: d6309f92a7f9af3ab61a878403e3d9c284ea87e0 I0813 10:19:43.245098 2542 leveldb.cpp:176] Opened db in 9.386828ms I0813 10:19:43.247138 2542 leveldb.cpp:183] Compacted db in 1.956669ms I0813 10:19:43.247194 2542 leveldb.cpp:198] Created db iterator in 13961ns I0813 10:19:43.247206 2542 leveldb.cpp:204] Seeked to beginning of db in 677ns I0813 10:19:43.247215 2542 leveldb.cpp:273] Iterated through 0 keys in the db in 243ns I0813 10:19:43.247252 2542 replica.cpp:744] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0813 10:19:43.248755 2611 log.cpp:238] Attempting to join replica to ZooKeeper group I0813 10:19:43.248924 2542 main.cpp:306] Starting Mesos master I0813 10:19:43.249244 2612 recover.cpp:449] Starting replica recovery I0813 10:19:43.250239 2612 recover.cpp:475] Replica is in EMPTY status I0813 10:19:43.250819 2612 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request I0813 10:19:43.251014 2607 recover.cpp:195] Received a recover response from a replica in EMPTY status *I0813 10:19:43.249503 2542 master.cpp:349] Master 20150813-101943-151938119-5050-2542 (71.100.14.9) started on 71.100.14.9:5050 http://71.100.14.9:5050* I0813 10:19:43.252053 2610 recover.cpp:566] Updating replica status to STARTING I0813 10:19:43.252571 2542 master.cpp:397] Master allowing unauthenticated frameworks to register I0813 10:19:43.253159 2542 master.cpp:402] Master allowing unauthenticated slaves to register I0813 10:19:43.254276 2612 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 1.816161ms I0813 10:19:43.254323 2612 replica.cpp:323] Persisted replica status to STARTING I0813 10:19:43.254905 2612 recover.cpp:475] Replica is in STARTING status I0813 10:19:43.255203 2612 replica.cpp:641] Replica in STARTING status received a broadcasted recover request I0813 10:19:43.255265 2612 recover.cpp:195] Received a recover response from a replica in STARTING status I0813 10:19:43.255343 2612 recover.cpp:566] Updating replica status to VOTING I0813 10:19:43.258730 2611 master.cpp:1295] Successfully attached file '/var/log/mesos/mesos-master.INFO' I0813 10:19:43.258760 2609 contender.cpp:131] Joining the ZK group I0813 10:19:43.258862 2612 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 3.477458ms I0813 10:19:43.258894 2612 replica.cpp:323] Persisted replica status to VOTING I0813 10:19:43.258934 2612 recover.cpp:580] Successfully joined the Paxos group I0813 10:19:43.258987 2612 recover.cpp:464] Recover process terminated I0813 10:19:46.590340 2606 group.cpp:313] Group process (group(1)@ 71.100.14.9:5050) connected to ZooKeeper I0813 10:19:46.590373 2606 group.cpp:790] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0) I0813 10:19:46.590386 2606 group.cpp:385] Trying to create path '/mesos/log_replicas' in ZooKeeper I0813 10:19:46.591442 2606 network.hpp:424] ZooKeeper group memberships changed I0813 10:19:46.591514 2606 group.cpp:659] Trying to get '/mesos/log_replicas/00' in ZooKeeper I0813 10:19:46.592146 2606 group.cpp:659] Trying to get '/mesos/log_replicas/01' in ZooKeeper I0813 10:19:46.593128 2608 network.hpp:466] ZooKeeper group PIDs: { log-replica(1)@127.0.1.1:5050 } I0813 10:19:46.593955 2608 group.cpp:313] Group process (group(2)@ 71.100.14.9:5050)
Re: Can't start master properly (stale state issue?); help!
To be really sure about the possible root cause, I'd need to know how you installed Mesos on your server, if it's via Mesosphere packages, the configuration is described here: https://open.mesosphere.com/reference/packages/ I am almost[0] sure the behavior you are seeing has something to do how the server resolves the hostname to an IP for your Master - unless you give an explicit IP address to bind to (--ip) libprocess will look up the hostname, reverse-DNS it, and resolve to an IP address: if that fails, it falls back to localhost. If you want to try a quick hack, you can run `cat /etc/hostname` on that server, and add a line in /etc/hosts that resolves that name to the actual IP address (71.100.14.9, in your logs). The other possibility is that it's really a 'stale state' in ZK - you can either drop the znode (whichever you used for the --zk path) or launch with a different one. Finally, if you have the option to run master without using the `service start`, by SSH'ing into the server and doing something like: /path/to/install/bin/mesos-master.sh --quorum=1 --work_dir=/tmp/mesos --zk=zk://ZK-IP:ZK-PORT/mesos/test --ip=71.100.14.9 and see whether that works. If none of the above helps, please let us know what you see and we'll keep debugging it :) BTW - the new leading master is a bit of a logging decoy, it's not actually new per se - so I'm almost[0] sure the leader never changed. [0] almost as this line confuses me: I0813 10:19:46.601297 2612 network.hpp:466] ZooKeeper group PIDs: { log-replica(1)@127.0.1.1:5050, log-replica(1)@71.100.14.9:5050 } (but that's because of my lack of deep understanding of how the log-replicas work) *Marco Massenzio* *Distributed Systems Engineer* On Thu, Aug 13, 2015 at 7:37 AM, Paul Bell arach...@gmail.com wrote: Hi All, I hope someone can shed some light on this because I'm getting desperate! I try to start components zk, mesos-master, and marathon in that order. They are started via a program that SSHs to the sole host and does service xxx start. Everyone starts happily enough. But the Mesos UI shows me: *This master is not the leader, redirecting in 0 seconds ... go now* The pattern seen in all of the mesos-master.INFO logs (one of which shown below) is that the mesos-master with the correct IP@ starts. But then a new leader is detected and becomes leading master. This new leader shows UPID *(UPID=master@127.0.1.1:5050 http://master@127.0.1.1:5050* I've tried clearing what ZK and mesos-master state I can find, but this problem will not go away. Would someone be so kind as to a) explain what is happening here and b) suggest remedies? Thanks very much. -Paul Log file created at: 2015/08/13 10:19:43 Running on machine: 71.100.14.9 Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg I0813 10:19:43.225636 2542 logging.cpp:172] INFO level logging started! I0813 10:19:43.235213 2542 main.cpp:181] Build: 2015-05-05 06:15:50 by root I0813 10:19:43.235244 2542 main.cpp:183] Version: 0.22.1 I0813 10:19:43.235257 2542 main.cpp:186] Git tag: 0.22.1 I0813 10:19:43.235268 2542 main.cpp:190] Git SHA: d6309f92a7f9af3ab61a878403e3d9c284ea87e0 I0813 10:19:43.245098 2542 leveldb.cpp:176] Opened db in 9.386828ms I0813 10:19:43.247138 2542 leveldb.cpp:183] Compacted db in 1.956669ms I0813 10:19:43.247194 2542 leveldb.cpp:198] Created db iterator in 13961ns I0813 10:19:43.247206 2542 leveldb.cpp:204] Seeked to beginning of db in 677ns I0813 10:19:43.247215 2542 leveldb.cpp:273] Iterated through 0 keys in the db in 243ns I0813 10:19:43.247252 2542 replica.cpp:744] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0813 10:19:43.248755 2611 log.cpp:238] Attempting to join replica to ZooKeeper group I0813 10:19:43.248924 2542 main.cpp:306] Starting Mesos master I0813 10:19:43.249244 2612 recover.cpp:449] Starting replica recovery I0813 10:19:43.250239 2612 recover.cpp:475] Replica is in EMPTY status I0813 10:19:43.250819 2612 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request I0813 10:19:43.251014 2607 recover.cpp:195] Received a recover response from a replica in EMPTY status *I0813 10:19:43.249503 2542 master.cpp:349] Master 20150813-101943-151938119-5050-2542 (71.100.14.9) started on 71.100.14.9:5050 http://71.100.14.9:5050* I0813 10:19:43.252053 2610 recover.cpp:566] Updating replica status to STARTING I0813 10:19:43.252571 2542 master.cpp:397] Master allowing unauthenticated frameworks to register I0813 10:19:43.253159 2542 master.cpp:402] Master allowing unauthenticated slaves to register I0813 10:19:43.254276 2612 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 1.816161ms I0813 10:19:43.254323 2612 replica.cpp:323] Persisted replica status to STARTING I0813 10:19:43.254905 2612 recover.cpp:475] Replica is in STARTING status I0813 10:19:43.255203 2612 replica.cpp:641] Replica in STARTING status
Can't start master properly (stale state issue?); help!
Hi All, I hope someone can shed some light on this because I'm getting desperate! I try to start components zk, mesos-master, and marathon in that order. They are started via a program that SSHs to the sole host and does service xxx start. Everyone starts happily enough. But the Mesos UI shows me: *This master is not the leader, redirecting in 0 seconds ... go now* The pattern seen in all of the mesos-master.INFO logs (one of which shown below) is that the mesos-master with the correct IP@ starts. But then a new leader is detected and becomes leading master. This new leader shows UPID *(UPID=master@127.0.1.1:5050 http://master@127.0.1.1:5050* I've tried clearing what ZK and mesos-master state I can find, but this problem will not go away. Would someone be so kind as to a) explain what is happening here and b) suggest remedies? Thanks very much. -Paul Log file created at: 2015/08/13 10:19:43 Running on machine: 71.100.14.9 Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg I0813 10:19:43.225636 2542 logging.cpp:172] INFO level logging started! I0813 10:19:43.235213 2542 main.cpp:181] Build: 2015-05-05 06:15:50 by root I0813 10:19:43.235244 2542 main.cpp:183] Version: 0.22.1 I0813 10:19:43.235257 2542 main.cpp:186] Git tag: 0.22.1 I0813 10:19:43.235268 2542 main.cpp:190] Git SHA: d6309f92a7f9af3ab61a878403e3d9c284ea87e0 I0813 10:19:43.245098 2542 leveldb.cpp:176] Opened db in 9.386828ms I0813 10:19:43.247138 2542 leveldb.cpp:183] Compacted db in 1.956669ms I0813 10:19:43.247194 2542 leveldb.cpp:198] Created db iterator in 13961ns I0813 10:19:43.247206 2542 leveldb.cpp:204] Seeked to beginning of db in 677ns I0813 10:19:43.247215 2542 leveldb.cpp:273] Iterated through 0 keys in the db in 243ns I0813 10:19:43.247252 2542 replica.cpp:744] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0813 10:19:43.248755 2611 log.cpp:238] Attempting to join replica to ZooKeeper group I0813 10:19:43.248924 2542 main.cpp:306] Starting Mesos master I0813 10:19:43.249244 2612 recover.cpp:449] Starting replica recovery I0813 10:19:43.250239 2612 recover.cpp:475] Replica is in EMPTY status I0813 10:19:43.250819 2612 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request I0813 10:19:43.251014 2607 recover.cpp:195] Received a recover response from a replica in EMPTY status *I0813 10:19:43.249503 2542 master.cpp:349] Master 20150813-101943-151938119-5050-2542 (71.100.14.9) started on 71.100.14.9:5050 http://71.100.14.9:5050* I0813 10:19:43.252053 2610 recover.cpp:566] Updating replica status to STARTING I0813 10:19:43.252571 2542 master.cpp:397] Master allowing unauthenticated frameworks to register I0813 10:19:43.253159 2542 master.cpp:402] Master allowing unauthenticated slaves to register I0813 10:19:43.254276 2612 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 1.816161ms I0813 10:19:43.254323 2612 replica.cpp:323] Persisted replica status to STARTING I0813 10:19:43.254905 2612 recover.cpp:475] Replica is in STARTING status I0813 10:19:43.255203 2612 replica.cpp:641] Replica in STARTING status received a broadcasted recover request I0813 10:19:43.255265 2612 recover.cpp:195] Received a recover response from a replica in STARTING status I0813 10:19:43.255343 2612 recover.cpp:566] Updating replica status to VOTING I0813 10:19:43.258730 2611 master.cpp:1295] Successfully attached file '/var/log/mesos/mesos-master.INFO' I0813 10:19:43.258760 2609 contender.cpp:131] Joining the ZK group I0813 10:19:43.258862 2612 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 3.477458ms I0813 10:19:43.258894 2612 replica.cpp:323] Persisted replica status to VOTING I0813 10:19:43.258934 2612 recover.cpp:580] Successfully joined the Paxos group I0813 10:19:43.258987 2612 recover.cpp:464] Recover process terminated I0813 10:19:46.590340 2606 group.cpp:313] Group process (group(1)@ 71.100.14.9:5050) connected to ZooKeeper I0813 10:19:46.590373 2606 group.cpp:790] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0) I0813 10:19:46.590386 2606 group.cpp:385] Trying to create path '/mesos/log_replicas' in ZooKeeper I0813 10:19:46.591442 2606 network.hpp:424] ZooKeeper group memberships changed I0813 10:19:46.591514 2606 group.cpp:659] Trying to get '/mesos/log_replicas/00' in ZooKeeper I0813 10:19:46.592146 2606 group.cpp:659] Trying to get '/mesos/log_replicas/01' in ZooKeeper I0813 10:19:46.593128 2608 network.hpp:466] ZooKeeper group PIDs: { log-replica(1)@127.0.1.1:5050 } I0813 10:19:46.593955 2608 group.cpp:313] Group process (group(2)@ 71.100.14.9:5050) connected to ZooKeeper I0813 10:19:46.593977 2608 group.cpp:790] Syncing group operations: queue size (joins, cancels, datas) = (1, 0, 0) I0813 10:19:46.593986 2608 group.cpp:385] Trying to create path '/mesos/log_replicas' in ZooKeeper I0813 10:19:46.594894 2605 group.cpp:313] Group process (group(3)@
Re: Can't start master properly (stale state issue?); help!
Marco hasodent, This is just a quick note to say thank you for your replies. I will answer you much more fully tomorrow, but for now can only manage a few quick observations questions: 1. Having some months ago encountered a known problem with the IP@ 127.0.1.1 (I'll provide references tomorrow), I early on configured /etc/hosts, replacing myHostName 127.0.1.1 with myHostName Real_IP. That said, I can't rule out a race condition whereby ZK | mesos-master saw the original unchanged /etc/hosts before I zapped it. 2. What is a znode and how would I drop it? I start the services (zk, master, marathon; all on same host) by SSHing into the host doing service start commands. Again, thanks very much; and more tomorrow. Cordially, Paul On Thu, Aug 13, 2015 at 1:08 PM, haosdent haosd...@gmail.com wrote: Hello, how you start the master? And could you try use netstat -antp|grep 5050 to find whether there are multi master processes run at a same machine or not? On Thu, Aug 13, 2015 at 10:37 PM, Paul Bell arach...@gmail.com wrote: Hi All, I hope someone can shed some light on this because I'm getting desperate! I try to start components zk, mesos-master, and marathon in that order. They are started via a program that SSHs to the sole host and does service xxx start. Everyone starts happily enough. But the Mesos UI shows me: *This master is not the leader, redirecting in 0 seconds ... go now* The pattern seen in all of the mesos-master.INFO logs (one of which shown below) is that the mesos-master with the correct IP@ starts. But then a new leader is detected and becomes leading master. This new leader shows UPID *(UPID=master@127.0.1.1:5050 http://master@127.0.1.1:5050* I've tried clearing what ZK and mesos-master state I can find, but this problem will not go away. Would someone be so kind as to a) explain what is happening here and b) suggest remedies? Thanks very much. -Paul Log file created at: 2015/08/13 10:19:43 Running on machine: 71.100.14.9 Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg I0813 10:19:43.225636 2542 logging.cpp:172] INFO level logging started! I0813 10:19:43.235213 2542 main.cpp:181] Build: 2015-05-05 06:15:50 by root I0813 10:19:43.235244 2542 main.cpp:183] Version: 0.22.1 I0813 10:19:43.235257 2542 main.cpp:186] Git tag: 0.22.1 I0813 10:19:43.235268 2542 main.cpp:190] Git SHA: d6309f92a7f9af3ab61a878403e3d9c284ea87e0 I0813 10:19:43.245098 2542 leveldb.cpp:176] Opened db in 9.386828ms I0813 10:19:43.247138 2542 leveldb.cpp:183] Compacted db in 1.956669ms I0813 10:19:43.247194 2542 leveldb.cpp:198] Created db iterator in 13961ns I0813 10:19:43.247206 2542 leveldb.cpp:204] Seeked to beginning of db in 677ns I0813 10:19:43.247215 2542 leveldb.cpp:273] Iterated through 0 keys in the db in 243ns I0813 10:19:43.247252 2542 replica.cpp:744] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0813 10:19:43.248755 2611 log.cpp:238] Attempting to join replica to ZooKeeper group I0813 10:19:43.248924 2542 main.cpp:306] Starting Mesos master I0813 10:19:43.249244 2612 recover.cpp:449] Starting replica recovery I0813 10:19:43.250239 2612 recover.cpp:475] Replica is in EMPTY status I0813 10:19:43.250819 2612 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request I0813 10:19:43.251014 2607 recover.cpp:195] Received a recover response from a replica in EMPTY status *I0813 10:19:43.249503 2542 master.cpp:349] Master 20150813-101943-151938119-5050-2542 (71.100.14.9) started on 71.100.14.9:5050 http://71.100.14.9:5050* I0813 10:19:43.252053 2610 recover.cpp:566] Updating replica status to STARTING I0813 10:19:43.252571 2542 master.cpp:397] Master allowing unauthenticated frameworks to register I0813 10:19:43.253159 2542 master.cpp:402] Master allowing unauthenticated slaves to register I0813 10:19:43.254276 2612 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 1.816161ms I0813 10:19:43.254323 2612 replica.cpp:323] Persisted replica status to STARTING I0813 10:19:43.254905 2612 recover.cpp:475] Replica is in STARTING status I0813 10:19:43.255203 2612 replica.cpp:641] Replica in STARTING status received a broadcasted recover request I0813 10:19:43.255265 2612 recover.cpp:195] Received a recover response from a replica in STARTING status I0813 10:19:43.255343 2612 recover.cpp:566] Updating replica status to VOTING I0813 10:19:43.258730 2611 master.cpp:1295] Successfully attached file '/var/log/mesos/mesos-master.INFO' I0813 10:19:43.258760 2609 contender.cpp:131] Joining the ZK group I0813 10:19:43.258862 2612 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 3.477458ms I0813 10:19:43.258894 2612 replica.cpp:323] Persisted replica status to VOTING I0813 10:19:43.258934 2612 recover.cpp:580] Successfully joined the Paxos group I0813 10:19:43.258987 2612 recover.cpp:464] Recover process
Re: Can't start master properly (stale state issue?); help!
On Thu, Aug 13, 2015 at 11:53 AM, Paul Bell arach...@gmail.com wrote: Marco hasodent, This is just a quick note to say thank you for your replies. No problem, you're welcome. I will answer you much more fully tomorrow, but for now can only manage a few quick observations questions: 1. Having some months ago encountered a known problem with the IP@ 127.0.1.1 (I'll provide references tomorrow), I early on configured /etc/hosts, replacing myHostName 127.0.1.1 with myHostName Real_IP. That said, I can't rule out a race condition whereby ZK | mesos-master saw the original unchanged /etc/hosts before I zapped it. 2. What is a znode and how would I drop it? so, the znode is the fancy name that ZK gives to the nodes in its tree (trivially, the path) - assuming that you give Mesos the following ZK URL: zk://10.10.0.5:2181/mesos/prod the 'znode' would be `/mesos/prod` and you could go inspect it (using zkCli.sh) by doing: ls /mesos/prod you should see at least one (with the Master running) file: info_001 or json.info_0001 (depending on whether you're running 0.23 or 0.24) and you could then inspect its contents with: get /mesos/prod/info_001 For example, if I run a Mesos 0.23 on my localhost, against ZK on the same: $ ./bin/mesos-master.sh --zk=zk://localhost:2181/mesos/test --quorum=1 --work_dir=/tmp/m23-2 --port=5053 I can connect to ZK via zkCli.sh and: [zk: localhost:2181(CONNECTED) 4] ls /mesos/test [info_06, log_replicas] [zk: localhost:2181(CONNECTED) 6] get /mesos/test/info_06 #20150813-120952-18983104-5053-14072ц 'master@192.168.33.1:5053 * 192.168.33.120.23.0 cZxid = 0x314 dataLength = 93 // a bunch of other metadata numChildren = 0 (you can remove it with - you guessed it - `rm -f /mesos/test` at the CLI prompt - stop Mesos first, or it will be a very unhappy Master :). in the corresponding logs I see (note the new leader here too, even though this was the one and only): I0813 12:09:52.126509 105455616 group.cpp:656] Trying to get '/mesos/test/info_06' in ZooKeeper W0813 12:09:52.127071 107065344 detector.cpp:444] Leading master master@192.168.33.1:5053 is using a Protobuf binary format when registering with ZooKeeper (info): this will be deprecated as of Mesos 0.24 (see MESOS-2340) I0813 12:09:52.127094 107065344 detector.cpp:481] A new leading master (UPID=master@192.168.33.1:5053) is detected I0813 12:09:52.127187 103845888 master.cpp:1481] The newly elected leader is master@192.168.33.1:5053 with id 20150813-120952-18983104-5053-14072 I0813 12:09:52.127209 103845888 master.cpp:1494] Elected as the leading master! At this point, I'm almost sure you're running up against some issue with the log-replica; but I'm the least competent guy here to help you on that one, hopefully someone else will be able to add insight here. I start the services (zk, master, marathon; all on same host) by SSHing into the host doing service start commands. Again, thanks very much; and more tomorrow. Cordially, Paul On Thu, Aug 13, 2015 at 1:08 PM, haosdent haosd...@gmail.com wrote: Hello, how you start the master? And could you try use netstat -antp|grep 5050 to find whether there are multi master processes run at a same machine or not? On Thu, Aug 13, 2015 at 10:37 PM, Paul Bell arach...@gmail.com wrote: Hi All, I hope someone can shed some light on this because I'm getting desperate! I try to start components zk, mesos-master, and marathon in that order. They are started via a program that SSHs to the sole host and does service xxx start. Everyone starts happily enough. But the Mesos UI shows me: *This master is not the leader, redirecting in 0 seconds ... go now* The pattern seen in all of the mesos-master.INFO logs (one of which shown below) is that the mesos-master with the correct IP@ starts. But then a new leader is detected and becomes leading master. This new leader shows UPID *(UPID=master@127.0.1.1:5050 http://master@127.0.1.1:5050* I've tried clearing what ZK and mesos-master state I can find, but this problem will not go away. Would someone be so kind as to a) explain what is happening here and b) suggest remedies? Thanks very much. -Paul Log file created at: 2015/08/13 10:19:43 Running on machine: 71.100.14.9 Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg I0813 10:19:43.225636 2542 logging.cpp:172] INFO level logging started! I0813 10:19:43.235213 2542 main.cpp:181] Build: 2015-05-05 06:15:50 by root I0813 10:19:43.235244 2542 main.cpp:183] Version: 0.22.1 I0813 10:19:43.235257 2542 main.cpp:186] Git tag: 0.22.1 I0813 10:19:43.235268 2542 main.cpp:190] Git SHA: d6309f92a7f9af3ab61a878403e3d9c284ea87e0 I0813 10:19:43.245098 2542 leveldb.cpp:176] Opened db in 9.386828ms I0813 10:19:43.247138 2542 leveldb.cpp:183] Compacted db in 1.956669ms I0813 10:19:43.247194 2542 leveldb.cpp:198] Created db iterator in 13961ns I0813 10:19:43.247206 2542
RE: Can't start master properly (stale state issue?); help!
I used to meet a similar issue with Zookeeper + Messo; I resolved it by remove 127.0.1.1 from /etc/hosts; here is an example: klaus@klaus-OptiPlex-780:~/Workspace/mesos$ cat /etc/hosts 127.0.0.1 localhost 127.0.1.1 klaus-OptiPlex-780 = remove this line, and a new line: mapping IP (e.g. 192.168.1.100) with hostname ... BTW, please also clear-up the log directory and re-start ZK Mesos. If any more comments, please let me know. Regards,Klaus Ma (马达), PMP® | http://www.cguru.net CallSend SMSCall from mobileAdd to SkypeYou'll need Skype CreditFree via SkypeCallSend SMSCall from mobileAdd to SkypeYou'll need Skype CreditFree via SkypeDate: Thu, 13 Aug 2015 12:20:34 -0700 Subject: Re: Can't start master properly (stale state issue?); help! From: ma...@mesosphere.io To: user@mesos.apache.org On Thu, Aug 13, 2015 at 11:53 AM, Paul Bell arach...@gmail.com wrote: Marco hasodent, This is just a quick note to say thank you for your replies. No problem, you're welcome. I will answer you much more fully tomorrow, but for now can only manage a few quick observations questions: 1. Having some months ago encountered a known problem with the IP@ 127.0.1.1 (I'll provide references tomorrow), I early on configured /etc/hosts, replacing myHostName 127.0.1.1 with myHostName Real_IP. That said, I can't rule out a race condition whereby ZK | mesos-master saw the original unchanged /etc/hosts before I zapped it. 2. What is a znode and how would I drop it? so, the znode is the fancy name that ZK gives to the nodes in its tree (trivially, the path) - assuming that you give Mesos the following ZK URL:zk://10.10.0.5:2181/mesos/prod the 'znode' would be `/mesos/prod` and you could go inspect it (using zkCli.sh) by doing: ls /mesos/prod you should see at least one (with the Master running) file: info_001 or json.info_0001 (depending on whether you're running 0.23 or 0.24) and you could then inspect its contents with: get /mesos/prod/info_001 For example, if I run a Mesos 0.23 on my localhost, against ZK on the same: $ ./bin/mesos-master.sh --zk=zk://localhost:2181/mesos/test --quorum=1 --work_dir=/tmp/m23-2 --port=5053I can connect to ZK via zkCli.sh and: [zk: localhost:2181(CONNECTED) 4] ls /mesos/test [info_06, log_replicas] [zk: localhost:2181(CONNECTED) 6] get /mesos/test/info_06 #20150813-120952-18983104-5053-14072ц 'master@192.168.33.1:5053* 192.168.33.120.23.0 cZxid = 0x314 dataLength = 93 // a bunch of other metadata numChildren = 0 (you can remove it with - you guessed it - `rm -f /mesos/test` at the CLI prompt - stop Mesos first, or it will be a very unhappy Master :). in the corresponding logs I see (note the new leader here too, even though this was the one and only): I0813 12:09:52.126509 105455616 group.cpp:656] Trying to get '/mesos/test/info_06' in ZooKeeper W0813 12:09:52.127071 107065344 detector.cpp:444] Leading master master@192.168.33.1:5053 is using a Protobuf binary format when registering with ZooKeeper (info): this will be deprecated as of Mesos 0.24 (see MESOS-2340) I0813 12:09:52.127094 107065344 detector.cpp:481] A new leading master (UPID=master@192.168.33.1:5053) is detected I0813 12:09:52.127187 103845888 master.cpp:1481] The newly elected leader is master@192.168.33.1:5053 with id 20150813-120952-18983104-5053-14072 I0813 12:09:52.127209 103845888 master.cpp:1494] Elected as the leading master! At this point, I'm almost sure you're running up against some issue with the log-replica; but I'm the least competent guy here to help you on that one, hopefully someone else will be able to add insight here. I start the services (zk, master, marathon; all on same host) by SSHing into the host doing service start commands. Again, thanks very much; and more tomorrow. Cordially, Paul On Thu, Aug 13, 2015 at 1:08 PM, haosdent haosd...@gmail.com wrote: Hello, how you start the master? And could you try use netstat -antp|grep 5050 to find whether there are multi master processes run at a same machine or not? On Thu, Aug 13, 2015 at 10:37 PM, Paul Bell arach...@gmail.com wrote: Hi All, I hope someone can shed some light on this because I'm getting desperate! I try to start components zk, mesos-master, and marathon in that order. They are started via a program that SSHs to the sole host and does service xxx start. Everyone starts happily enough. But the Mesos UI shows me: This master is not the leader, redirecting in 0 seconds ... go now The pattern seen in all of the mesos-master.INFO logs (one of which shown below) is that the mesos-master with the correct IP@ starts. But then a new leader is detected and becomes leading master. This new leader shows UPID (UPID=master@127.0.1.1:5050 I've tried clearing what ZK and mesos-master state I can find, but this problem will not go away. Would someone be so kind as to a) explain what is happening here and b) suggest remedies? Thanks very much