I don't know if it helps debugging, but I am seeing the following in tserver_shrine.log
2014-01-01 06:15:37,852 [hdfs.DFSClient] INFO : Exception in createBlockOutputStream 10.240.165.43:50010 java.io.IOException: Bad connect ack with firstBadLink as 10.240.203.36:50010 2014-01-01 06:15:37,852 [hdfs.DFSClient] INFO : Abandoning block blk_-2756969025267118869_1348 2014-01-01 06:15:37,855 [hdfs.DFSClient] INFO : Excluding datanode 10.240.203.36:50010 2014-01-01 06:15:38,147 [hdfs.DFSClient] INFO : Exception in createBlockOutputStream 10.240.165.43:50010 java.io.IOException: Bad connect ack with firstBadLink as 10.240.203.36:50010 2014-01-01 06:15:38,148 [hdfs.DFSClient] INFO : Abandoning block blk_2883724569463729419_1349 2014-01-01 06:15:38,149 [hdfs.DFSClient] INFO : Excluding datanode 10.240.203.36:50010 2014-01-01 06:15:38,554 [client.ClientServiceHandler] ERROR: ThriftSecurityException(user:root, code:BAD_CREDENTIALS) 2014-01-01 06:15:39,559 [client.ClientServiceHandler] ERROR: ThriftSecurityException(user:root, code:BAD_CREDENTIALS) 2014-01-01 06:15:40,565 [client.ClientServiceHandler] ERROR: ThriftSecurityException(user:root, code:BAD_CREDENTIALS) 2014-01-01 06:15:41,571 [client.ClientServiceHandler] ERROR: ThriftSecurityException(user:root, code:BAD_CREDENTIALS) 2014-01-01 06:15:42,578 [client.ClientServiceHandler] ERROR: ThriftSecurityException(user:root, code:BAD_CREDENTIALS) 2014-01-01 06:15:43,586 [client.ClientServiceHandler] ERROR: ThriftSecurityException(user:root, code:BAD_CREDENTIALS) 2014-01-01 06:15:44,594 [client.ClientServiceHandler] ERROR: ThriftSecurityException(user:root, code:BAD_CREDENTIALS) On Wed, Jan 1, 2014 at 2:28 PM, Josh Elser <[email protected]> wrote: > Sure -- you have my address already. > > Also, nc not working while the tabletserver is dead makes sense (that > process is what's listening on that port). Once the process dies, there's > nothing else listening. > > > On 1/1/2014 1:31 PM, Arshak Navruzyan wrote: > >> If anyone wants to look at my live environment please let me know (your >> gmail) and I will add you to the Google Compute Engine. Thanks! >> >> >> On Wed, Jan 1, 2014 at 7:58 AM, Arshak Navruzyan <[email protected] >> <mailto:[email protected]>> wrote: >> >> Sean >> >> Thanks for looking into the log files. >> >> These are two Google compute engine instance under the same project >> so there shouldn't be any firewall between them. >> >> For the brief moment that the slave runs during startup, I can nc >> into port 9997 from the master to the slave. But after it crashes, >> I can't. Seems like somehow the problem is on the slave. >> >> Arshak >> >> On Dec 31, 2013 11:58 PM, "Sean Busbey" <[email protected] >> <mailto:busbey%[email protected]>> wrote: >> >> Well, I can tell you the proximal cause. the tserver log shows >> that it starts normally, then exits because it's told to (via >> the zookeeper lock being removed). >> >> If you look at the master debug logs, this happens because the >> master fails in three attempts to talk to the tserver, all with >> the same error: >> >> 2014-01-01 06:17:20,231 [master.Master] ERROR: unable to get >> tablet server status 10.240.203.36:9997[1434c70ed30001b] >> org.apache.thrift.transport.TTransportException: >> java.net.NoRouteToHostException: No route to host >> >> Unfortunately, this is the same error you noticed in your first >> email. After 3 of those, the master deletes the zk lock so that >> the tserver will shutdown. >> >> Could there be another firewall blocking access to port 9997 on >> the worker machine from the master machine? >> >> Check from the master (you'll need netcat): >> >> $ nc -z 10.240.203.36 9997 >> $ echo $? >> >> >> >> >> >> On Wed, Jan 1, 2014 at 12:33 AM, Arshak Navruzyan >> <[email protected] <mailto:[email protected]>> wrote: >> >> I am probably missing something really basic so I posted >> both the master and the slave log files: >> >> https://www.dropbox.com/sh/liv1mzuohyiv6uu/X5kx7AZJ6i >> >> Thanks again to everyone for the help! >> >> >> On Tue, Dec 31, 2013 at 10:20 PM, Arshak Navruzyan >> <[email protected] <mailto:[email protected]>> wrote: >> >> disabled selinux (iptables already off) on both master >> and slave but didn't make a difference unfortunately. >> >> >> >> On Tue, Dec 31, 2013 at 9:25 PM, Kurt Christensen >> <[email protected] <mailto:[email protected]>> wrote: >> >> >> SELINUX disabled? IPTABLES configured? I have >> nothing else. >> >> Kurt >> >> ------ >> >> >> On 12/31/13 6:02 PM, Arshak Navruzyan wrote: >> >> I configured a new instance with a master and a >> slave tserver. When I do start-all on the >> master, the slave doesn't come up. I am >> wondering if it's because I left the instance >> secret as the default. (I get an exception when >> I try to change that). >> >> This is what I see in the master's monitor >> regarding the slave >> >> Non-Functioning Tablet Servers >> The following tablet servers reported a >> status other than Online >> >> 10.240.203.36:9997 <http://10.240.203.36:9997> >> <http://10.240.203.36:9997> UNRESPONSIVE >> >> >> >> In the master log I see the following >> >> 2013-12-31 22:56:13,665 [master.Master] >> ERROR: unable to get >> tablet server status >> 10.240.203.36:9997[__1434a79d34404a2] >> >> org.apache.thrift.transport.__ >> TTransportException: >> java.net >> <http://java.net>.__NoRouteToHostException: No >> >> route to host >> 2013-12-31 22:56:13,712 [master.Master] >> ERROR: unable to get >> tablet server status >> 10.240.203.36:9997[__1434a79d34404a2] >> >> org.apache.thrift.transport.__ >> TTransportException: >> java.net >> <http://java.net>.__NoRouteToHostException: No >> >> route to host >> 2013-12-31 22:56:13,802 >> [balancer.TableLoadBalancer] INFO : Loaded >> class >> >> org.apache.accumulo.server.__master.balancer.__ >> DefaultLoadBalancer >> >> for >> table !0 >> 2013-12-31 22:56:13,803 [master.Master] >> INFO : Assigning 1 tablets >> 2013-12-31 22:56:13,812 [master.Master] >> ERROR: Error processing >> table state for store Root Tablet >> >> org.apache.thrift.transport.__ >> TTransportException: >> java.net >> <http://java.net>.__NoRouteToHostException: No >> route to host >> at >> >> org.apache.accumulo.core.__client.impl.__ >> ThriftTransportPool.__createNewTransport(__ThriftTransportPool.java:475) >> at >> >> org.apache.accumulo.core.__client.impl.__ >> ThriftTransportPool.__getTransport(__ThriftTransportPool.java:464) >> at >> >> org.apache.accumulo.core.__client.impl.__ >> ThriftTransportPool.__getTransport(__ThriftTransportPool.java:441) >> at >> >> org.apache.accumulo.core.__client.impl.__ >> ThriftTransportPool.__getTransportWithDefaultTimeout >> __(ThriftTransportPool.java:366) >> >> >> >> >> In the slave's tserver.log all I see is >> >> 2013-12-31 22:56:34,731 >> [tabletserver.TabletServer] FATAL: Lost >> tablet server lock (reason = LOCK_DELETED), >> exiting. >> >> >> -- >> >> Kurt Christensen >> P.O. Box 811 >> Westminster, MD 21158-0811 >> >> ------------------------------ >> __------------------------------__------------ >> >> If you can't explain it simply, you don't understand >> it well enough. -- Albert Einstein >> >> >> >> >> >> >> -- >> Sean >> >> >>
