Here is the unhappiness log:  (it was one server only):

2010-10-19 23:58:57,041 WARN
org.apache.zookeeper.server.quorum.QuorumCnxManager: Send worker
leaving thread
2010-10-19 23:59:57,010 WARN
org.apache.zookeeper.server.quorum.QuorumCnxManager: Connection
broken:
java.io.IOException: Channel eof
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:630)
2010-10-19 23:59:57,010 WARN
org.apache.zookeeper.server.quorum.QuorumCnxManager: Connection
broken:
java.io.IOException: Channel eof
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:630)
2010-10-19 23:59:57,011 WARN
org.apache.zookeeper.server.quorum.QuorumCnxManager: Connection
broken:
java.io.IOException: Channel eof
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:630)
2010-10-19 23:59:57,012 WARN
org.apache.zookeeper.server.quorum.QuorumCnxManager: Connection
broken:
java.io.IOException: Channel eof
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:630)
2010-10-19 23:59:57,012 WARN
org.apache.zookeeper.server.quorum.QuorumCnxManager: Connection
broken:
java.io.IOException: Channel eof
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:630)
2010-10-19 23:59:57,013 WARN
org.apache.zookeeper.server.quorum.QuorumCnxManager: Interrupted while
waiting for message on queue
java.lang.InterruptedException
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1899)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1976)
        at 
java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:342)
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:570)
2010-10-19 23:59:57,013 WARN
org.apache.zookeeper.server.quorum.QuorumCnxManager: Send worker
leaving thread
2010-10-19 23:59:57,014 WARN
org.apache.zookeeper.server.quorum.QuorumCnxManager: Exception when
using channel: 4
java.nio.channels.ClosedChannelException
        at 
sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.send(QuorumCnxManager.java:548)
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:578)
2010-10-19 23:59:57,014 WARN
org.apache.zookeeper.server.quorum.QuorumCnxManager: Interrupted while
waiting for message on queue
java.lang.InterruptedException
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1899)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1976)
        at 
java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:342)
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:570)
2010-10-19 23:59:57,015 WARN
org.apache.zookeeper.server.quorum.QuorumCnxManager: Send worker
leaving thread
2010-10-19 23:59:57,015 WARN
org.apache.zookeeper.server.quorum.QuorumCnxManager: Send worker
leaving thread




On Fri, Oct 22, 2010 at 2:21 PM, Andrew Purtell <[email protected]> wrote:
> This is at the root of the trouble with the REST server also I expect.
>
> You said your ZooKeeper ensemble peer was unhappy? Can we see the logs? Did 
> you report this to the ZK guys?
>
> Best regards,
>
>    - Andy
>
>
> --- On Fri, 10/22/10, Jack Levin <[email protected]> wrote:
>
>> From: Jack Levin <[email protected]>
>> Subject: Re: cold restart/region servers issue
>> To: [email protected]
>> Date: Friday, October 22, 2010, 1:31 PM
>> one of my zookeepers was unhappy, and
>> did not report /hbase directory,
>> I shut it down, and things started to work much better.
>>
>> -Jack
>>
>> On Fri, Oct 22, 2010 at 10:56 AM, Stack <[email protected]>
>> wrote:
>> > Hmm... does it emit that message once or continuously.
>>  In log we emit
>> > the ensemble we're trying to contact.  Does it look
>> correct?  When the
>> > machine is having this issue next time, try running
>> the zk cmdline
>> > client and see if you can see a znode at
>> /hbase/master:
>> >
>> > $ ./bin/hbase org.apache.zookeeper.ZooKeeperMain
>> -server HOST:PORT
>> >
>> > Where HOST:PORT are what the RS is reporting for zk
>> ensemble.
>> >
>> > Once you have the zk cmdline client up, do something
>> like
>> >
>> > ls /hbase
>> >
>> >
>> > ....
>> >
>> >
>> > St.Ack
>> >
>> > On Fri, Oct 22, 2010 at 10:42 AM, Jack Levin <[email protected]>
>> wrote:
>> >> Same ZK all the time, restart of regionserver
>> clears the issue.  I
>> >> even see them talking to ZK via tcpdump, is there
>> a way to enable
>> >> debug log output on ZK to see with might be going
>> on?
>> >>
>> >> -Jack
>> >>
>> >> On Fri, Oct 22, 2010 at 10:28 AM, Stack <[email protected]>
>> wrote:
>> >>> Are they pointed to the same zk ensemble as
>> the other 22 servers? That
>> >>> is, are they running with the same config?
>>  The below complaint is
>> >>> that the regionserver is not seeing master
>> register, perhaps because
>> >>> they are homed at the wrong location in zk or
>> because they are going
>> >>> to a different zk?
>> >>> St.Ack
>> >>>
>> >>> On Fri, Oct 22, 2010 at 8:34 AM, Jack Levin
>> <[email protected]>
>> wrote:
>> >>>> I have 30 region servers, after cold
>> restart (master, zookepeers, and
>> >>>> all regionservers), 22 regionservers
>> start, but the other 8 have
>> >>>> following errors,
>> >>>> any idea how to debug this?  Is zookeeper
>> giving the RS wrong msg?
>> >>>> Can I log it via tcpdump maybe?
>> >>>>
>> >>>> 2010-10-22 08:32:42,035 WARN
>> >>>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Unable
>> to read
>> >>>> master address from ZooKeeper. Retrying.
>> Error was:
>> >>>> java.io.IOException:
>> >>>>
>> org.apache.zookeeper.KeeperException$NoNodeException:
>> KeeperErrorCode
>> >>>> = NoNode for /hbase/master
>> >>>>        at
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readAddressOrThrow(ZooKeeperWrapper.java:481)
>> >>>>        at
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readMasterAddressOrThrow(ZooKeeperWrapper.java:377)
>> >>>>        at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.getMaster(HRegionServer.java:1289)
>> >>>>        at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:1320)
>> >>>>        at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:519)
>> >>>>        at
>> java.lang.Thread.run(Thread.java:619)
>> >>>> Caused by:
>> org.apache.zookeeper.KeeperException$NoNodeException:
>> >>>> KeeperErrorCode = NoNode for
>> /hbase/master
>> >>>>        at
>> org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
>> >>>>        at
>> org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>> >>>>        at
>> org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:921)
>> >>>>        at
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readAddressOrThrow(ZooKeeperWrapper.java:477)
>> >>>>        ... 5 more
>> >>>>
>> >>>
>> >>
>> >
>>
>
>
>
>

Reply via email to