Josh, Yea Zookeeper is running on the master and I can connect to it using zkCli from the slave.
/etc/hosts looks fine 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 10.240.203.36 shoki.c.accumulo-test.internal shoki # Added by Google Hmm, completely baffled! Arshak On Tue, Dec 31, 2013 at 6:35 PM, Josh Elser <[email protected]> wrote: > On 12/31/13, 6:37 PM, Arshak Navruzyan wrote: > >> Here is my route -n >> >> Kernel IP routing table >> Destination Gateway Genmask Flags Metric Ref Use >> Iface >> 10.240.0.1 0.0.0.0 255.255.255.255 UH 0 0 0 >> eth0 >> 169.254.0.0 0.0.0.0 255.255.0.0 U 1002 0 0 >> eth0 >> 0.0.0.0 10.240.0.1 0.0.0.0 UG 0 0 0 >> eth0 >> >> >> "slave tserver" is another physical machine (well google compute engine >> instance). Yes one gce instance is running master (and slave) and the >> other is running just slave. >> >> here is my config: >> >> masters: >> 10.240.165.43 >> >> slaves: >> 10.240.165.43 >> 10.240.203.36 >> >> BTW when I run bin/check-slaves conf/slaves >> # WRITABLE value not configured, not checking partitions >> 10.240.165.43 >> 10.240.203.36 >> >> Is the master supposed to be listed in the slaves files too? >> > > No, your configuration files look correct. > > I'm not sure why but for whatever reason, your slave (10.240.203.36) can't > talk back to the master (10.240.165.43), but at least that's where you want > to look at things. You know that the master can talk to the slave > (otherwise the slave tserver would have never started) and that the slave > tserver can talk to ZooKeeper (that it had and then lost a lock in ZK). Are > you running ZooKeeper on the master (that would further isolate it in > debugging this). > > It may be worthwhile to double check your /etc/hosts entries just to be > safe. Aside from that, I can't think of anything else at the moment. > > >> On Tue, Dec 31, 2013 at 3:32 PM, Josh Elser <[email protected] >> <mailto:[email protected]>> wrote: >> >> Maybe check the output of `route -n` on the master? It might be >> something weird with DNS as well. >> >> When you say "slave tserver", are you talking about a separate >> physical machine? You have one node running the Accumulo master and >> another running a tserver? >> >> >> On 12/31/13, 6:02 PM, Arshak Navruzyan wrote: >> >> I configured a new instance with a master and a slave tserver. >> When I >> do start-all on the master, the slave doesn't come up. I am >> wondering >> if it's because I left the instance secret as the default. (I get >> an >> exception when I try to change that). >> >> This is what I see in the master's monitor regarding the slave >> >> Non-Functioning Tablet Servers >> The following tablet servers reported a status other than >> Online >> >> 10.240.203.36:9997 <http://10.240.203.36:9997> >> <http://10.240.203.36:9997> UNRESPONSIVE >> >> >> >> In the master log I see the following >> >> 2013-12-31 22:56:13,665 [master.Master] ERROR: unable to >> get tablet >> server status 10.240.203.36:9997[__1434a79d34404a2] >> org.apache.thrift.transport.__TTransportException: >> java.net <http://java.net>.__NoRouteToHostException: No route to >> >> host >> 2013-12-31 22:56:13,712 [master.Master] ERROR: unable to >> get tablet >> server status 10.240.203.36:9997[__1434a79d34404a2] >> org.apache.thrift.transport.__TTransportException: >> java.net <http://java.net>.__NoRouteToHostException: No route to >> >> host >> 2013-12-31 22:56:13,802 [balancer.TableLoadBalancer] INFO : >> Loaded >> class >> org.apache.accumulo.server.__master.balancer.__ >> DefaultLoadBalancer >> >> for table !0 >> 2013-12-31 22:56:13,803 [master.Master] INFO : Assigning 1 >> tablets >> 2013-12-31 22:56:13,812 [master.Master] ERROR: Error >> processing >> table state for store Root Tablet >> org.apache.thrift.transport.__TTransportException: >> java.net <http://java.net>.__NoRouteToHostException: No route to >> host >> at >> >> org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__ >> createNewTransport(__ThriftTransportPool.java:475) >> at >> >> org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__ >> getTransport(__ThriftTransportPool.java:464) >> at >> >> org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__ >> getTransport(__ThriftTransportPool.java:441) >> at >> >> org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__ >> getTransportWithDefaultTimeout__(ThriftTransportPool.java:366) >> >> >> >> >> In the slave's tserver.log all I see is >> >> 2013-12-31 22:56:34,731 [tabletserver.TabletServer] FATAL: >> Lost >> tablet server lock (reason = LOCK_DELETED), exiting. >> >> >>
