Re: slave tserver not responding

Josh Elser Wed, 01 Jan 2014 14:42:04 -0800

The BAD_CREDENTIALS error is just the root password not matching thetrace.token.property.password. By default, the configurations set thepassword for Accumulo's distributed trace mechanism to be "secret".

It's best to make a special user and password for tracing and configureit in accumulo-site.xml. An easy way to get rid of that error is to justset the aforementioned property equal to the root password (and chmod600 accumulo-site.xml) ;)


On 1/1/14, 3:46 PM, Michael Wall wrote:

I don't know if it helps debugging, but I am seeing the following in
tserver_shrine.log

2014-01-01 06:15:37,852 [hdfs.DFSClient] INFO : Exception in
createBlockOutputStream 10.240.165.43:50010 <http://10.240.165.43:50010>
java.io.IOException: Bad connect ack with firstBadLink as
10.240.203.36:50010 <http://10.240.203.36:50010>
2014-01-01 06:15:37,852 [hdfs.DFSClient] INFO : Abandoning block
blk_-2756969025267118869_1348
2014-01-01 06:15:37,855 [hdfs.DFSClient] INFO : Excluding datanode
10.240.203.36:50010 <http://10.240.203.36:50010>
2014-01-01 06:15:38,147 [hdfs.DFSClient] INFO : Exception in
createBlockOutputStream 10.240.165.43:50010 <http://10.240.165.43:50010>
java.io.IOException: Bad connect ack with firstBadLink as
10.240.203.36:50010 <http://10.240.203.36:50010>
2014-01-01 06:15:38,148 [hdfs.DFSClient] INFO : Abandoning block
blk_2883724569463729419_1349
2014-01-01 06:15:38,149 [hdfs.DFSClient] INFO : Excluding datanode
10.240.203.36:50010 <http://10.240.203.36:50010>
2014-01-01 06:15:38,554 [client.ClientServiceHandler] ERROR:
ThriftSecurityException(user:root, code:BAD_CREDENTIALS)
2014-01-01 06:15:39,559 [client.ClientServiceHandler] ERROR:
ThriftSecurityException(user:root, code:BAD_CREDENTIALS)
2014-01-01 06:15:40,565 [client.ClientServiceHandler] ERROR:
ThriftSecurityException(user:root, code:BAD_CREDENTIALS)
2014-01-01 06:15:41,571 [client.ClientServiceHandler] ERROR:
ThriftSecurityException(user:root, code:BAD_CREDENTIALS)
2014-01-01 06:15:42,578 [client.ClientServiceHandler] ERROR:
ThriftSecurityException(user:root, code:BAD_CREDENTIALS)
2014-01-01 06:15:43,586 [client.ClientServiceHandler] ERROR:
ThriftSecurityException(user:root, code:BAD_CREDENTIALS)
2014-01-01 06:15:44,594 [client.ClientServiceHandler] ERROR:
ThriftSecurityException(user:root, code:BAD_CREDENTIALS)



On Wed, Jan 1, 2014 at 2:28 PM, Josh Elser <[email protected]
<mailto:[email protected]>> wrote:

    Sure -- you have my address already.

    Also, nc not working while the tabletserver is dead makes sense
    (that process is what's listening on that port). Once the process
    dies, there's nothing else listening.


    On 1/1/2014 1:31 PM, Arshak Navruzyan wrote:

        If anyone wants to look at my live environment please let me
        know (your
        gmail) and I will add you to the Google Compute Engine.  Thanks!


        On Wed, Jan 1, 2014 at 7:58 AM, Arshak Navruzyan
        <[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>> wrote:

             Sean

             Thanks for looking into the log files.

             These are two Google compute engine instance under the same
        project
             so there shouldn't be any firewall between them.

             For the brief moment that the slave runs during startup, I
        can nc
             into port 9997 from the master to the slave.  But after it
        crashes,
             I can't.  Seems like somehow the problem is on the slave.

             Arshak

             On Dec 31, 2013 11:58 PM, "Sean Busbey"
        <[email protected] <mailto:busbey%[email protected]>
             <mailto:busbey%2Bml@__clouderagovt.com
        <mailto:busbey%[email protected]>>> wrote:

                 Well, I can tell you the proximal cause.  the tserver
        log shows
                 that it starts normally, then exits because it's told
        to (via
                 the zookeeper lock being removed).

                 If you look at the master debug logs, this happens
        because the
                 master fails in three attempts to talk to the tserver,
        all with
                 the same error:

                 2014-01-01 06:17:20,231 [master.Master] ERROR: unable
        to get
                 tablet server status 10.240.203.36:9997[__1434c70ed30001b]
                 org.apache.thrift.transport.__TTransportException:
        java.net <http://java.net>.__NoRouteToHostException: No route to
        host

                 Unfortunately, this is the same error you noticed in
        your first
                 email. After 3 of those, the master deletes the zk lock
        so that
                 the tserver will shutdown.

                 Could there be another firewall blocking access to port
        9997 on
                 the worker machine from the master machine?

                 Check from the master (you'll need netcat):

                 $ nc -z 10.240.203.36 9997
                 $ echo $?





                 On Wed, Jan 1, 2014 at 12:33 AM, Arshak Navruzyan
                 <[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>> wrote:

                     I am probably missing something really basic so I
        posted
                     both the master and the slave log files:

        https://www.dropbox.com/sh/__liv1mzuohyiv6uu/X5kx7AZJ6i
        <https://www.dropbox.com/sh/liv1mzuohyiv6uu/X5kx7AZJ6i>

                     Thanks again to everyone for the help!


                     On Tue, Dec 31, 2013 at 10:20 PM, Arshak Navruzyan
                     <[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>> wrote:

                         disabled selinux (iptables already off) on both
        master
                         and slave but didn't make a difference
        unfortunately.



                         On Tue, Dec 31, 2013 at 9:25 PM, Kurt Christensen
                         <[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>> wrote:


                             SELINUX disabled? IPTABLES configured? I have
                             nothing else.

                             Kurt

                             ------


                             On 12/31/13 6:02 PM, Arshak Navruzyan wrote:

                                 I configured a new instance with a
        master and a
                                 slave tserver.  When I do start-all on the
                                 master, the slave doesn't come up.  I am
                                 wondering if it's because I left the
        instance
                                 secret as the default. (I get an
        exception when
                                 I try to change that).

                                 This is what I see in the master's monitor
                                 regarding the slave

                                      Non-Functioning Tablet Servers
                                      The following tablet servers
        reported a
                                 status other than Online

        10.240.203.36:9997 <http://10.240.203.36:9997>
        <http://10.240.203.36:9997>
                                 <http://10.240.203.36:9997>  UNRESPONSIVE



                                 In the master log I see the following

                                      2013-12-31 22:56:13,665
        [master.Master]
                                 ERROR: unable to get
                                      tablet server status
                                 10.240.203.36:9997[____1434a79d34404a2]


        org.apache.thrift.transport.____TTransportException:
        java.net <http://java.net>

        <http://java.net>.____NoRouteToHostException: No

                                 route to host
                                      2013-12-31 22:56:13,712
        [master.Master]
                                 ERROR: unable to get
                                      tablet server status
                                 10.240.203.36:9997[____1434a79d34404a2]


        org.apache.thrift.transport.____TTransportException:
        java.net <http://java.net>

        <http://java.net>.____NoRouteToHostException: No

                                 route to host
                                      2013-12-31 22:56:13,802
                                 [balancer.TableLoadBalancer] INFO : Loaded
                                      class


        org.apache.accumulo.server.____master.balancer.____DefaultLoadBalancer

                                 for
                                      table !0
                                      2013-12-31 22:56:13,803
        [master.Master]
                                 INFO : Assigning 1 tablets
                                      2013-12-31 22:56:13,812
        [master.Master]
                                 ERROR: Error processing
                                      table state for store Root Tablet


        org.apache.thrift.transport.____TTransportException:
        java.net <http://java.net>

        <http://java.net>.____NoRouteToHostException: No
                                 route to host
                                              at


        
org.apache.accumulo.core.____client.impl.____ThriftTransportPool.____createNewTransport(____ThriftTransportPool.java:475)
                                              at


        
org.apache.accumulo.core.____client.impl.____ThriftTransportPool.____getTransport(____ThriftTransportPool.java:464)
                                              at


        
org.apache.accumulo.core.____client.impl.____ThriftTransportPool.____getTransport(____ThriftTransportPool.java:441)
                                              at


        
org.apache.accumulo.core.____client.impl.____ThriftTransportPool.____getTransportWithDefaultTimeout____(ThriftTransportPool.java:__366)




                                 In the slave's tserver.log all I see is

                                      2013-12-31 22:56:34,731
                                 [tabletserver.TabletServer] FATAL: Lost
                                      tablet server lock (reason =
        LOCK_DELETED),
                                 exiting.


                             --

                             Kurt Christensen
                             P.O. Box 811
                             Westminster, MD 21158-0811


        
------------------------------____----------------------------__--__------------

                             If you can't explain it simply, you don't
        understand
                             it well enough. -- Albert Einstein






                 --
                 Sean

Re: slave tserver not responding

Reply via email to