Ok -- turned out to be a couple of little things, with one big one :D

The big one -- iptables was still running on the slave :)

I noticed that you were getting the same noroutetohost exceptions coming from the datanode logs trying to replicate, so I assume there was something outside of Hadoop. A `telnet slave_ip_addr port` on with the information that was showing up in the stack trace verified that I indeed could not. IPtables had an exception for SSH, so that's why SSH'ing still worked and Arshak could start the processes.

Small things:

It looked like IPv6 was still running via ifconfig, I disabled those via procfs and disabled them permanently via sysctl. That would have likely caused more trouble but I noticed this before iptables.

Max open files was still at 1024, which was likely to cause you more problems. I just upped them for the user you run Accumulo as.

- Josh

On 1/1/14, 2:28 PM, Josh Elser wrote:
Sure -- you have my address already.

Also, nc not working while the tabletserver is dead makes sense (that
process is what's listening on that port). Once the process dies,
there's nothing else listening.

On 1/1/2014 1:31 PM, Arshak Navruzyan wrote:
If anyone wants to look at my live environment please let me know (your
gmail) and I will add you to the Google Compute Engine.  Thanks!


On Wed, Jan 1, 2014 at 7:58 AM, Arshak Navruzyan <[email protected]
<mailto:[email protected]>> wrote:

    Sean

    Thanks for looking into the log files.

    These are two Google compute engine instance under the same project
    so there shouldn't be any firewall between them.

    For the brief moment that the slave runs during startup, I can nc
    into port 9997 from the master to the slave.  But after it crashes,
    I can't.  Seems like somehow the problem is on the slave.

    Arshak

    On Dec 31, 2013 11:58 PM, "Sean Busbey" <[email protected]
    <mailto:busbey%[email protected]>> wrote:

        Well, I can tell you the proximal cause.  the tserver log shows
        that it starts normally, then exits because it's told to (via
        the zookeeper lock being removed).

        If you look at the master debug logs, this happens because the
        master fails in three attempts to talk to the tserver, all with
        the same error:

        2014-01-01 06:17:20,231 [master.Master] ERROR: unable to get
        tablet server status 10.240.203.36:9997[1434c70ed30001b]
        org.apache.thrift.transport.TTransportException:
        java.net.NoRouteToHostException: No route to host

        Unfortunately, this is the same error you noticed in your first
        email. After 3 of those, the master deletes the zk lock so that
        the tserver will shutdown.

        Could there be another firewall blocking access to port 9997 on
        the worker machine from the master machine?

        Check from the master (you'll need netcat):

        $ nc -z 10.240.203.36 9997
        $ echo $?





        On Wed, Jan 1, 2014 at 12:33 AM, Arshak Navruzyan
        <[email protected] <mailto:[email protected]>> wrote:

            I am probably missing something really basic so I posted
            both the master and the slave log files:

            https://www.dropbox.com/sh/liv1mzuohyiv6uu/X5kx7AZJ6i

            Thanks again to everyone for the help!


            On Tue, Dec 31, 2013 at 10:20 PM, Arshak Navruzyan
            <[email protected] <mailto:[email protected]>> wrote:

                disabled selinux (iptables already off) on both master
                and slave but didn't make a difference unfortunately.



                On Tue, Dec 31, 2013 at 9:25 PM, Kurt Christensen
                <[email protected] <mailto:[email protected]>> wrote:


                    SELINUX disabled? IPTABLES configured? I have
                    nothing else.

                    Kurt

                    ------


                    On 12/31/13 6:02 PM, Arshak Navruzyan wrote:

                        I configured a new instance with a master and a
                        slave tserver.  When I do start-all on the
                        master, the slave doesn't come up.  I am
                        wondering if it's because I left the instance
                        secret as the default. (I get an exception when
                        I try to change that).

                        This is what I see in the master's monitor
                        regarding the slave

                             Non-Functioning Tablet Servers
                             The following tablet servers reported a
                        status other than Online

                        10.240.203.36:9997 <http://10.240.203.36:9997>
                        <http://10.240.203.36:9997>  UNRESPONSIVE



                        In the master log I see the following

                             2013-12-31 22:56:13,665 [master.Master]
                        ERROR: unable to get
                             tablet server status
                        10.240.203.36:9997[__1434a79d34404a2]


org.apache.thrift.transport.__TTransportException:
                        java.net
                        <http://java.net>.__NoRouteToHostException: No
                        route to host
                             2013-12-31 22:56:13,712 [master.Master]
                        ERROR: unable to get
                             tablet server status
                        10.240.203.36:9997[__1434a79d34404a2]


org.apache.thrift.transport.__TTransportException:
                        java.net
                        <http://java.net>.__NoRouteToHostException: No
                        route to host
                             2013-12-31 22:56:13,802
                        [balancer.TableLoadBalancer] INFO : Loaded
                             class


org.apache.accumulo.server.__master.balancer.__DefaultLoadBalancer
                        for
                             table !0
                             2013-12-31 22:56:13,803 [master.Master]
                        INFO : Assigning 1 tablets
                             2013-12-31 22:56:13,812 [master.Master]
                        ERROR: Error processing
                             table state for store Root Tablet


org.apache.thrift.transport.__TTransportException:
                        java.net
                        <http://java.net>.__NoRouteToHostException: No
                        route to host
                                     at


org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__createNewTransport(__ThriftTransportPool.java:475)

                                     at


org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__getTransport(__ThriftTransportPool.java:464)

                                     at


org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__getTransport(__ThriftTransportPool.java:441)

                                     at


org.apache.accumulo.core.__client.impl.__ThriftTransportPool.__getTransportWithDefaultTimeout__(ThriftTransportPool.java:366)




                        In the slave's tserver.log all I see is

                             2013-12-31 22:56:34,731
                        [tabletserver.TabletServer] FATAL: Lost
                             tablet server lock (reason = LOCK_DELETED),
                        exiting.


                    --

                    Kurt Christensen
                    P.O. Box 811
                    Westminster, MD 21158-0811


------------------------------__------------------------------__------------

                    If you can't explain it simply, you don't understand
                    it well enough. -- Albert Einstein






        --
        Sean


Reply via email to