Re: Unresponsive master in Hbase 0.90.0

Vidhyashankar Venkataraman Fri, 28 Jan 2011 10:40:57 -0800

>> Is this new cluster start or master joining an already running cluster (looks
>> like former).


Either way, I get this problem. In particular, these logs were pulled out after 
I had done a createTable with boundaries (around 100 empty regions per node) 
and shut it down and then restarted. A similar thing happened when master was 
restarted on a running cluster.

>> Can you fix this?  Can you run w/ a working append?
I will run with hadoop append and see what happens.

>> Can you do 'kill -QUIT PID' and see if anything shows in the .out file for 
>> hbase?
>> It looks like a master is hung for sure.   My guess would be that
>> we're making some presumption based on presence of append.  Lets
>>figure it and fix.

Some updates on the what happened with the master:
5 hours after the master was sleeping, I saw some activity in the logs (a 
thread timeout I suppose): Note the first line here was the last line in the 
log snippet I posted in the previous mail. I checked whether there was any 
mistake with the gc: I looked at the vmstat/gc logs and I couldn't find any 
swaps/ serious gc happening. The master has crashed with a number

2011-01-28 07:35:49,877 INFO 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Waiting for split writer 
threads to finish
2011-01-28 12:12:55,309 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 
16625510ms instead of 10000ms, this is likely due to a long garbage collecting 
pause and it's usually bad, see 
http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9
2011-01-28 12:12:55,309 INFO org.apache.zookeeper.ClientCnxn: Client session 
timed out, have not heard from server in 16629300ms for sessionid 
0x32dcb848b98010e, closing socket connection and attempting reconnect
2011-01-28 12:12:55,309 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 
16674993ms instead of 60000ms, this is likely due to a long garbage collecting 
pause and it's usually bad, see 
http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9
2011-01-28 12:12:55,310 INFO org.apache.zookeeper.ClientCnxn: Client session 
timed out, have not heard from server in 16635680ms for sessionid 
0x2dcb848b210138, closing socket connection and attempting reconnect
2011-01-28 12:12:55,310 INFO 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Split writers finished
2011-01-28 12:12:55,310 INFO 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: hlog file splitting 
completed in 16625444 ms for 
hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615
2011-01-28 12:12:55,310 INFO org.apache.hadoop.hbase.master.MasterFileSystem: 
Log folder 
hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296185180336
 doesn't belong to a known region server, splitting
2011-01-28 12:12:55,315 WARN org.apache.hadoop.hbase.master.LogCleaner: Error 
while cleaning the logsjava.io.IOException: Call to 
b3110120.yst.yahoo.net/67.195.46.238:4600 failed on local exception: 
java.io.InterruptedIOException: Interruped while waiting for IO on channel 
java.nio.channels.SocketChannel[connected local=/67.195.48.110:56229 
remote=b3110120.yst.yahoo.net/67.195.46.238:4600]. 59999 millis timeout left.
        at org.apache.hadoop.ipc.Client.wrapException(Client.java:776)
        at org.apache.hadoop.ipc.Client.call(Client.java:744)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
        at $Proxy5.getListing(Unknown Source)
        at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
        at $Proxy5.getListing(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:606)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:252)
        at org.apache.hadoop.hbase.master.LogCleaner.chore(LogCleaner.java:128)
        at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
        at org.apache.hadoop.hbase.master.LogCleaner.run(LogCleaner.java:167)

And after that a lot of exceptions like the following:
2011-01-28 12:12:55,464 INFO org.apache.hadoop.hbase.master.MasterFileSystem: 
Log folder hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110566.yst.yahoo.ne
t,60020,1296199618571 belongs to an existing region server
2011-01-28 12:12:55,464 INFO org.apache.hadoop.hbase.master.MasterFileSystem: 
Log folder hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110567.yst.yahoo.ne
t,60020,1296183218152 doesn't belong to a known region server, splitting
2011-01-28 12:12:55,464 ERROR org.apache.hadoop.hbase.master.MasterFileSystem: 
Failed splitting hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110567.yst.y
ahoo.net,60020,1296183218152
java.io.IOException: Filesystem closed
        at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:222)
        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:613)     
   at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:643)        
at 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:177)
        at 
org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:196)
        at 
org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:180)
        at 
org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:378)
        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:277)
2011-01-28 12:12:55,464 INFO org.apache.hadoop.hbase.master.MasterFileSystem: 
Log folder hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110567.yst.yahoo.ne
t,60020,1296185180085 doesn't belong to a known region server, splitting
2011-01-28 12:12:55,464 ERROR org.apache.hadoop.hbase.master.MasterFileSystem: 
Failed splitting hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110567.yst.y
ahoo.net,60020,1296185180085
java.io.IOException: Filesystem closed
        at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:222)
        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:613)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:643)        
at 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:177)
        at 
org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:196)
        at 
org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:180)
        at 
org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:378)
        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:277)


Cheers
V


On 1/28/11 8:54 AM, "Stack" <[email protected]> wrote:

On Thu, Jan 27, 2011 at 11:56 PM, Vidhyashankar Venkataraman
<[email protected]> wrote:
> 2011-01-28 07:35:49,866 INFO org.apache.hadoop.hbase.master.MasterFileSystem: 
> Log folder 
> hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110270.yst.yahoo.net,60020,1296199618314
>  belongs to an existing region server
> 2011-01-28 07:35:49,866 INFO org.apache.hadoop.hbase.master.MasterFileSystem: 
> Log folder 
> hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615
>  doesn't belong to a known region server, splitting

Vidhya:

This looks like a master that is starting up.  Is that right? Is this
new cluster start or master joining an already running cluster (looks
like former).


> 2011-01-28 07:35:49,867 INFO 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting 1 hlog(s) in 
> hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615
> 2011-01-28 07:35:49,867 DEBUG 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread 
> Thread[WriterThread-0,5,main]: starting
> 2011-01-28 07:35:49,867 DEBUG 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread 
> Thread[WriterThread-1,5,main]: starting
> 2011-01-28 07:35:49,867 DEBUG 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 1 of 1: 
> hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266,
>  length=0
> 2011-01-28 07:35:49,867 DEBUG 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread 
> Thread[WriterThread-2,5,main]: starting
> 2011-01-28 07:35:49,867 WARN org.apache.hadoop.hbase.util.FSUtils: Running on 
> HDFS without append enabled may result in data loss


Can you fix this?  Can you run w/ a working append?


> 2011-01-28 07:35:49,867 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: File 
> hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266
>  might be still open, length is 0
> 2011-01-28 07:35:49,869 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Could not open 
> hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266
>  for reading. File is emptyjava.io.EOFException
> 2011-01-28 07:35:49,875 INFO 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Archived processed log 
> hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266
>  to 
> hdfs://b3110120.yst.yahoo.net:4600/hbase/.oldlogs/b3110271.yst.yahoo.net%3A60020.1296183219266
> 2011-01-28 07:35:49,877 INFO 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Waiting for split 
> writer threads to finish
>

Can you do 'kill -QUIT PID' and see if anything shows in the .out file
for hbase?

It looks like a master is hung for sure.   My guess would be that
we're making some presumption based on presence of append.  Lets
figure it and fix.

Good on you V,
St.Ack

Re: Unresponsive master in Hbase 0.90.0

Reply via email to