Hi all, I've had a Hadoop system with hbase working for quite a long time now. We've got hadoop-hbase-master-0.90.6+84.73-1 installed on Red Hat 5, with four regionservers on slave nodes, and the rest and thrift server running on the master. Just today, and pretty much without warning the master crashed. Now we can't restart it. It starts, and then almost immediately dies. No error message is appearing in the log, though it's cleaning itself up normally. The log contains only:
2013-08-02T14:34:40.142-0400: [GC [ParNew: 17024K->1334K(19136K), 0.0052490 secs] 17024K->1334K(83008K), 0.0053100 secs] [Times: user=0.02 sys=0.01, real=0.01 secs] 2013-08-02T14:34:40.347-0400: [GC [1 CMS-initial-mark: 0K(63872K)] 9036K(83008K), 0.0071700 secs] [Times: user=0.02 sys=0.01, real=0.01 secs] 2013-08-02T14:34:40.471-0400: [GC [ParNew: 18358K->1234K(19136K), 0.0265690 secs] 18358K->2644K(83008K), 0.0266550 secs] [Times: user=0.12 sys=0.00, real=0.03 secs] 2013-08-02T14:34:40.630-0400: [CMS-concurrent-mark: 0.013/0.275 secs] [Times: user=0.53 sys=0.01, real=0.27 secs] 2013-08-02T14:34:40.645-0400: [CMS-concurrent-preclean: 0.014/0.015 secs] [Times: user=0.01 sys=0.00, real=0.02 secs] 2013-08-02T14:34:40.645-0400: [CMS-concurrent-abortable-preclean: 0.000/0.000 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 2013-08-02T14:34:40.645-0400: [GC[YG occupancy: 7584 K (19136 K)][Rescan (parallel) , 0.0030240 secs][weak refs processing, 0.0000090 secs] [1 CMS-remark: 1410K(63872K)] 8994K(83008K), 0.0031230 secs] [Times: user=0.02 sys=0.00, real=0.00 secs] 2013-08-02T14:34:40.649-0400: [CMS-concurrent-sweep: 0.000/0.000 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 2013-08-02T14:34:40.726-0400: [CMS-concurrent-reset: 0.077/0.077 secs] [Times: user=0.02 sys=0.05, real=0.08 secs] Heap par new generation total 19136K, used 7584K [0x00002b7281fe0000, 0x00002b72834a0000, 0x00002b72957e0000) eden space 17024K, 37% used [0x00002b7281fe0000, 0x00002b7282613928, 0x00002b7283080000) from space 2112K, 58% used [0x00002b7283080000, 0x00002b72831b4838, 0x00002b7283290000) to space 2112K, 0% used [0x00002b7283290000, 0x00002b7283290000, 0x00002b72834a0000) concurrent mark-sweep generation total 63872K, used 1410K [0x00002b72957e0000, 0x00002b7299640000, 0x00002b7475fe0000) concurrent-mark-sweep perm gen total 26256K, used 15758K [0x00002b7475fe0000, 0x00002b7477984000, 0x00002b747b3e0000) And if I restart I get essentially the exact same log overwriting this one (with new timestamps of course). The rest, thrift, and all the regionservers appear fine. There's no issues with disk space or resources on the server box and HDFS appears fine. Any advice of other places I can look for more data or how I might get more granularity in the logs? Or does someone see an error I'm missing in what already being logged? Thanks in advance, Trevor Antczak
