I don't know if it helps debugging, but I am seeing the following in
tserver_shrine.log
2014-01-01 06:15:37,852 [hdfs.DFSClient] INFO : Exception in
createBlockOutputStream 10.240.165.43:50010 <http://10.240.165.43:50010>
java.io.IOException: Bad connect ack with firstBadLink as
10.240.203.36:50010 <http://10.240.203.36:50010>
2014-01-01 06:15:37,852 [hdfs.DFSClient] INFO : Abandoning block
blk_-2756969025267118869_1348
2014-01-01 06:15:37,855 [hdfs.DFSClient] INFO : Excluding datanode
10.240.203.36:50010 <http://10.240.203.36:50010>
2014-01-01 06:15:38,147 [hdfs.DFSClient] INFO : Exception in
createBlockOutputStream 10.240.165.43:50010 <http://10.240.165.43:50010>
java.io.IOException: Bad connect ack with firstBadLink as
10.240.203.36:50010 <http://10.240.203.36:50010>
2014-01-01 06:15:38,148 [hdfs.DFSClient] INFO : Abandoning block
blk_2883724569463729419_1349
2014-01-01 06:15:38,149 [hdfs.DFSClient] INFO : Excluding datanode
10.240.203.36:50010 <http://10.240.203.36:50010>
2014-01-01 06:15:38,554 [client.ClientServiceHandler] ERROR:
ThriftSecurityException(user:root, code:BAD_CREDENTIALS)
2014-01-01 06:15:39,559 [client.ClientServiceHandler] ERROR:
ThriftSecurityException(user:root, code:BAD_CREDENTIALS)
2014-01-01 06:15:40,565 [client.ClientServiceHandler] ERROR:
ThriftSecurityException(user:root, code:BAD_CREDENTIALS)
2014-01-01 06:15:41,571 [client.ClientServiceHandler] ERROR:
ThriftSecurityException(user:root, code:BAD_CREDENTIALS)
2014-01-01 06:15:42,578 [client.ClientServiceHandler] ERROR:
ThriftSecurityException(user:root, code:BAD_CREDENTIALS)
2014-01-01 06:15:43,586 [client.ClientServiceHandler] ERROR:
ThriftSecurityException(user:root, code:BAD_CREDENTIALS)
2014-01-01 06:15:44,594 [client.ClientServiceHandler] ERROR:
ThriftSecurityException(user:root, code:BAD_CREDENTIALS)
On Wed, Jan 1, 2014 at 2:28 PM, Josh Elser <[email protected]
<mailto:[email protected]>> wrote:
Sure -- you have my address already.
Also, nc not working while the tabletserver is dead makes sense
(that process is what's listening on that port). Once the process
dies, there's nothing else listening.
On 1/1/2014 1:31 PM, Arshak Navruzyan wrote:
If anyone wants to look at my live environment please let me
know (your
gmail) and I will add you to the Google Compute Engine. Thanks!
On Wed, Jan 1, 2014 at 7:58 AM, Arshak Navruzyan
<[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>> wrote:
Sean
Thanks for looking into the log files.
These are two Google compute engine instance under the same
project
so there shouldn't be any firewall between them.
For the brief moment that the slave runs during startup, I
can nc
into port 9997 from the master to the slave. But after it
crashes,
I can't. Seems like somehow the problem is on the slave.
Arshak
On Dec 31, 2013 11:58 PM, "Sean Busbey"
<[email protected] <mailto:busbey%[email protected]>
<mailto:busbey%2Bml@__clouderagovt.com
<mailto:busbey%[email protected]>>> wrote:
Well, I can tell you the proximal cause. the tserver
log shows
that it starts normally, then exits because it's told
to (via
the zookeeper lock being removed).
If you look at the master debug logs, this happens
because the
master fails in three attempts to talk to the tserver,
all with
the same error:
2014-01-01 06:17:20,231 [master.Master] ERROR: unable
to get
tablet server status 10.240.203.36:9997[__1434c70ed30001b]
org.apache.thrift.transport.__TTransportException:
java.net <http://java.net>.__NoRouteToHostException: No route to
host
Unfortunately, this is the same error you noticed in
your first
email. After 3 of those, the master deletes the zk lock
so that
the tserver will shutdown.
Could there be another firewall blocking access to port
9997 on
the worker machine from the master machine?
Check from the master (you'll need netcat):
$ nc -z 10.240.203.36 9997
$ echo $?
On Wed, Jan 1, 2014 at 12:33 AM, Arshak Navruzyan
<[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>> wrote:
I am probably missing something really basic so I
posted
both the master and the slave log files:
https://www.dropbox.com/sh/__liv1mzuohyiv6uu/X5kx7AZJ6i
<https://www.dropbox.com/sh/liv1mzuohyiv6uu/X5kx7AZJ6i>
Thanks again to everyone for the help!
On Tue, Dec 31, 2013 at 10:20 PM, Arshak Navruzyan
<[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>> wrote:
disabled selinux (iptables already off) on both
master
and slave but didn't make a difference
unfortunately.
On Tue, Dec 31, 2013 at 9:25 PM, Kurt Christensen
<[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>> wrote:
SELINUX disabled? IPTABLES configured? I have
nothing else.
Kurt
------
On 12/31/13 6:02 PM, Arshak Navruzyan wrote:
I configured a new instance with a
master and a
slave tserver. When I do start-all on the
master, the slave doesn't come up. I am
wondering if it's because I left the
instance
secret as the default. (I get an
exception when
I try to change that).
This is what I see in the master's monitor
regarding the slave
Non-Functioning Tablet Servers
The following tablet servers
reported a
status other than Online
10.240.203.36:9997 <http://10.240.203.36:9997>
<http://10.240.203.36:9997>
<http://10.240.203.36:9997> UNRESPONSIVE
In the master log I see the following
2013-12-31 22:56:13,665
[master.Master]
ERROR: unable to get
tablet server status
10.240.203.36:9997[____1434a79d34404a2]
org.apache.thrift.transport.____TTransportException:
java.net <http://java.net>
<http://java.net>.____NoRouteToHostException: No
route to host
2013-12-31 22:56:13,712
[master.Master]
ERROR: unable to get
tablet server status
10.240.203.36:9997[____1434a79d34404a2]
org.apache.thrift.transport.____TTransportException:
java.net <http://java.net>
<http://java.net>.____NoRouteToHostException: No
route to host
2013-12-31 22:56:13,802
[balancer.TableLoadBalancer] INFO : Loaded
class
org.apache.accumulo.server.____master.balancer.____DefaultLoadBalancer
for
table !0
2013-12-31 22:56:13,803
[master.Master]
INFO : Assigning 1 tablets
2013-12-31 22:56:13,812
[master.Master]
ERROR: Error processing
table state for store Root Tablet
org.apache.thrift.transport.____TTransportException:
java.net <http://java.net>
<http://java.net>.____NoRouteToHostException: No
route to host
at
org.apache.accumulo.core.____client.impl.____ThriftTransportPool.____createNewTransport(____ThriftTransportPool.java:475)
at
org.apache.accumulo.core.____client.impl.____ThriftTransportPool.____getTransport(____ThriftTransportPool.java:464)
at
org.apache.accumulo.core.____client.impl.____ThriftTransportPool.____getTransport(____ThriftTransportPool.java:441)
at
org.apache.accumulo.core.____client.impl.____ThriftTransportPool.____getTransportWithDefaultTimeout____(ThriftTransportPool.java:__366)
In the slave's tserver.log all I see is
2013-12-31 22:56:34,731
[tabletserver.TabletServer] FATAL: Lost
tablet server lock (reason =
LOCK_DELETED),
exiting.
--
Kurt Christensen
P.O. Box 811
Westminster, MD 21158-0811
------------------------------____----------------------------__--__------------
If you can't explain it simply, you don't
understand
it well enough. -- Albert Einstein
--
Sean