I also see a number of these warnings in the zookeeper logs, which look quite
telling. Zookeeper is running on the slaves in question, and port 3888 is
unblocked in the firewall.
2012-12-29 07:23:42,492 WARN org.apache.zookeeper.server.quorum.QuorumCnxManager
: Cannot open channel to 1 at election address slave1.analytics-internal.lokistu
dios.com/10.171.98.247:3888
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.ja
va:327)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocket
Impl.java:193)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java
:180)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
at java.net.Socket.connect(Socket.java:546)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(Quorum
CnxManager.java:354)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxM
anager.java:327)
at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$Worke
rSender.process(FastLeaderElection.java:393)
at
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:365)
at java.lang.Thread.run(Thread.java:679)
--
Marco Gallotta | Mountain View, California
Software Engineer, Infrastructure | Loki Studios
fb.me/marco.gallotta | twitter.com/marcog
[email protected] | +1 (650) 417-3313
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
On Saturday 29 December 2012 at 1:44 PM, Marco Gallotta wrote:
> Hi there
>
> I've been running an hbase cluster for several months, and it recently
> experienced problems as the nodes reached 95% disk capacity. I added an extra
> node, and now the master keeps crashing with the errors below. I also
> increased the disk capacity on each individual node after this, and the
> errors are the same. I tried removing the new node, and that doesn't help.
>
> There are similar errors in the regionserver and zookeeper logs, but the all
> seem to echo from the master logs.
>
> Anything I can look at to help diagnose what the problem here is?
>
> hbase-root-master-analytics.log:
> Sat Dec 29 03:14:22 PST 2012 Starting master on analytics
> core file size (blocks, -c) 0
> data seg size (kbytes, -d) unlimited
> scheduling priority (-e) 0
> file size (blocks, -f) unlimited
> pending signals (-i) 59480
> max locked memory (kbytes, -l) 64
> max memory size (kbytes, -m) unlimited
> open files (-n) 1024
> pipe size (512 bytes, -p) 8
> POSIX message queues (bytes, -q) 819200
> real-time priority (-r) 0
> stack size (kbytes, -s) 8192
> cpu time (seconds, -t) unlimited
> max user processes (-u) 59480
> virtual memory (kbytes, -v) unlimited
> file locks (-x) unlimited
> 2012-12-29 03:14:24,601 INFO org.apache.hadoop.ipc.HBaseServer: Starting
> Thread-2
> 2012-12-29 03:14:24,614 INFO org.apache.hadoop.ipc.HBaseServer: Starting
> Thread-2
> 2012-12-29 03:14:24,622 INFO org.apache.hadoop.ipc.HBaseServer: Starting
> Thread-2
> 2012-12-29 03:14:24,631 INFO org.apache.hadoop.ipc.HBaseServer: Starting
> Thread-2
> 2012-12-29 03:14:24,636 INFO org.apache.hadoop.ipc.HBaseServer: Starting
> Thread-2
> 2012-12-29 03:14:24,643 INFO org.apache.hadoop.ipc.HBaseServer: Starting
> Thread-2
> 2012-12-29 03:14:24,651 INFO org.apache.hadoop.ipc.HBaseServer: Starting
> Thread-2
> 2012-12-29 03:14:24,665 INFO org.apache.hadoop.ipc.HBaseServer: Starting
> Thread-2
> 2012-12-29 03:14:24,675 INFO org.apache.hadoop.ipc.HBaseServer: Starting
> Thread-2
> 2012-12-29 03:14:24,698 INFO org.apache.hadoop.ipc.HBaseServer: Starting IPC
> Server listener on 60000
> 2012-12-29 03:14:25,322 WARN
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
> ZooKeeper exception:
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
> = ConnectionLoss for /hbase
> 2012-12-29 03:14:28,735 WARN
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
> ZooKeeper exception:
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
> = ConnectionLoss for /hbase
>
> 2012-12-29 03:14:32,797 WARN
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
> ZooKeeper exception:
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
> = ConnectionLoss for /hbase
> 2012-12-29 03:14:41,427 WARN
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
> ZooKeeper exception:
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
> = ConnectionLoss for /hbase
> 2012-12-29 03:14:41,427 ERROR
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper exists
> failed after 3 retries
> 2012-12-29 03:14:41,428 ERROR
> org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start master
> java.lang.RuntimeException: Failed construction of Master: class
> org.apache.hadoop.hbase.master.HMaster
>
> at
> org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:1740)
> at
> org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:146)
> at
> org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:103)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at
> org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:76)
> at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1754)
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /hbase
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1021)
> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1049)
> at
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:176)
> at
> org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:896)
> at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.createBaseZNodes(ZooKeeperWatcher.java:161)
> at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:154)
> at org.apache.hadoop.hbase.master.HMaster.<init>(HMaster.java:281)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:532)
> at
> org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:1735)
> ... 5 more
>
>
> --
> Marco Gallotta | Mountain View, California
> Software Engineer, Infrastructure | Loki Studios
> fb.me/marco.gallotta (http://fb.me/marco.gallotta) | twitter.com/marcog
> (http://twitter.com/marcog)
> [email protected] (mailto:[email protected]) | +1 (650) 417-3313
>
> Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
>