Re: Unresponsive master in Hbase 0.90.0

Stack Fri, 28 Jan 2011 10:49:55 -0800

On Fri, Jan 28, 2011 at 10:40 AM, Vidhyashankar Venkataraman
<[email protected]> wrote:
>>> Is this new cluster start or master joining an already running cluster 
>>> (looks
>>> like former).
>
> Either way, I get this problem. In particular, these logs were pulled out 
> after I had done a createTable with boundaries (around 100 empty regions per 
> node) and shut it down and then restarted. A similar thing happened when 
> master was restarted on a running cluster.
>


If you move aside
hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615,
does it come up then?




>>> Can you fix this?  Can you run w/ a working append?
> I will run with hadoop append and see what happens.
>

Well, we should figure this issue first if you don't mind but yeah,
you should be up on append.


>>> Can you do 'kill -QUIT PID' and see if anything shows in the .out file for 
>>> hbase?
>>> It looks like a master is hung for sure.   My guess would be that
>>> we're making some presumption based on presence of append.  Lets
>>>figure it and fix.
>
> Some updates on the what happened with the master:
> 5 hours after the master was sleeping, I saw some activity in the logs (a 
> thread timeout I suppose): Note the first line here was the last line in the 
> log snippet I posted in the previous mail. I checked whether there was any 
> mistake with the gc: I looked at the vmstat/gc logs and I couldn't find any 
> swaps/ serious gc happening. The master has crashed with a number
>
> 2011-01-28 07:35:49,877 INFO 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Waiting for split 
> writer threads to finish
> 2011-01-28 12:12:55,309 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 
> 16625510ms instead of 10000ms, this is likely due to a long garbage 
> collecting pause and it's usually bad, see 
> http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9


Thats an impressive sleep time V.


> 2011-01-28 12:12:55,309 INFO org.apache.zookeeper.ClientCnxn: Client session 
> timed out, have not heard from server in 16629300ms for sessionid 
> 0x32dcb848b98010e, closing socket connection and attempting reconnect
> 2011-01-28 12:12:55,309 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 
> 16674993ms instead of 60000ms, this is likely due to a long garbage 
> collecting pause and it's usually bad, see 
> http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9
> 2011-01-28 12:12:55,310 INFO org.apache.zookeeper.ClientCnxn: Client session 
> timed out, have not heard from server in 16635680ms for sessionid 
> 0x2dcb848b210138, closing socket connection and attempting reconnect
> 2011-01-28 12:12:55,310 INFO 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Split writers finished
> 2011-01-28 12:12:55,310 INFO 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: hlog file splitting 
> completed in 16625444 ms for 
> hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615


So, it 'completed'.

I wonder what it was doing over that time period.  If you move aside
the above dir, does it come up?  If you put it back to you repro the
above?  If you repro, can you do the kill -QUIT PID a few times.  It'd
be real interesting to see what its up to.


> 2011-01-28 12:12:55,310 INFO org.apache.hadoop.hbase.master.MasterFileSystem: 
> Log folder 
> hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296185180336
>  doesn't belong to a known region server, splitting
> 2011-01-28 12:12:55,315 WARN org.apache.hadoop.hbase.master.LogCleaner: Error 
> while cleaning the logsjava.io.IOException: Call to 
> b3110120.yst.yahoo.net/67.195.46.238:4600 failed on local exception: 
> java.io.InterruptedIOException: Interruped while waiting for IO on channel 
> java.nio.channels.SocketChannel[connected local=/67.195.48.110:56229 
> remote=b3110120.yst.yahoo.net/67.195.46.238:4600]. 59999 millis timeout left.
>        at org.apache.hadoop.ipc.Client.wrapException(Client.java:776)
>        at org.apache.hadoop.ipc.Client.call(Client.java:744)
>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
>        at $Proxy5.getListing(Unknown Source)
>        at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
>        at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>        at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>        at $Proxy5.getListing(Unknown Source)
>        at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:606)
>        at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:252)
>        at org.apache.hadoop.hbase.master.LogCleaner.chore(LogCleaner.java:128)
>        at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
>        at org.apache.hadoop.hbase.master.LogCleaner.run(LogCleaner.java:167)
>
> And after that a lot of exceptions like the following:
> 2011-01-28 12:12:55,464 INFO org.apache.hadoop.hbase.master.MasterFileSystem: 
> Log folder 
> hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110566.yst.yahoo.ne
> t,60020,1296199618571 belongs to an existing region server
> 2011-01-28 12:12:55,464 INFO org.apache.hadoop.hbase.master.MasterFileSystem: 
> Log folder 
> hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110567.yst.yahoo.ne
> t,60020,1296183218152 doesn't belong to a known region server, splitting
> 2011-01-28 12:12:55,464 ERROR 
> org.apache.hadoop.hbase.master.MasterFileSystem: Failed splitting 
> hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110567.yst.y
> ahoo.net,60020,1296183218152
> java.io.IOException: Filesystem closed
>        at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:222)
>        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:613)    
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453)
>        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:643)        
> at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:177)
>        at 
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:196)
>         at 
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:180)
>        at 
> org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:378)
>        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:277)
> 2011-01-28 12:12:55,464 INFO org.apache.hadoop.hbase.master.MasterFileSystem: 
> Log folder 
> hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110567.yst.yahoo.ne
> t,60020,1296185180085 doesn't belong to a known region server, splitting
> 2011-01-28 12:12:55,464 ERROR 
> org.apache.hadoop.hbase.master.MasterFileSystem: Failed splitting 
> hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110567.yst.y
> ahoo.net,60020,1296185180085
> java.io.IOException: Filesystem closed
>        at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:222)
>        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:613)
>        at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453)
>        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:643)        
> at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:177)
>        at 
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:196)
>         at 
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:180)
>        at 
> org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:378)
>        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:277)
>
>


Hmm.. maybe before you restart remove the directory
hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/ completely so no files
to be processed on restart.

I wonder if there is some incompatiblity such that we can't process
WALs written with older hbase (I presume above WALs written with a
version of hbase before 0.90?).

St.Ack



> Cheers
> V
>
>
> On 1/28/11 8:54 AM, "Stack" <[email protected]> wrote:
>
> On Thu, Jan 27, 2011 at 11:56 PM, Vidhyashankar Venkataraman
> <[email protected]> wrote:
>> 2011-01-28 07:35:49,866 INFO 
>> org.apache.hadoop.hbase.master.MasterFileSystem: Log folder 
>> hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110270.yst.yahoo.net,60020,1296199618314
>>  belongs to an existing region server
>> 2011-01-28 07:35:49,866 INFO 
>> org.apache.hadoop.hbase.master.MasterFileSystem: Log folder 
>> hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615
>>  doesn't belong to a known region server, splitting
>
> Vidhya:
>
> This looks like a master that is starting up.  Is that right? Is this
> new cluster start or master joining an already running cluster (looks
> like former).
>
>
>> 2011-01-28 07:35:49,867 INFO 
>> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting 1 hlog(s) 
>> in 
>> hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615
>> 2011-01-28 07:35:49,867 DEBUG 
>> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread 
>> Thread[WriterThread-0,5,main]: starting
>> 2011-01-28 07:35:49,867 DEBUG 
>> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread 
>> Thread[WriterThread-1,5,main]: starting
>> 2011-01-28 07:35:49,867 DEBUG 
>> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 1 of 
>> 1: 
>> hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266,
>>  length=0
>> 2011-01-28 07:35:49,867 DEBUG 
>> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread 
>> Thread[WriterThread-2,5,main]: starting
>> 2011-01-28 07:35:49,867 WARN org.apache.hadoop.hbase.util.FSUtils: Running 
>> on HDFS without append enabled may result in data loss
>
>
> Can you fix this?  Can you run w/ a working append?
>
>
>> 2011-01-28 07:35:49,867 WARN 
>> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: File 
>> hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266
>>  might be still open, length is 0
>> 2011-01-28 07:35:49,869 WARN 
>> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Could not open 
>> hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266
>>  for reading. File is emptyjava.io.EOFException
>> 2011-01-28 07:35:49,875 INFO 
>> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Archived processed 
>> log 
>> hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266
>>  to 
>> hdfs://b3110120.yst.yahoo.net:4600/hbase/.oldlogs/b3110271.yst.yahoo.net%3A60020.1296183219266
>> 2011-01-28 07:35:49,877 INFO 
>> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Waiting for split 
>> writer threads to finish
>>
>
> Can you do 'kill -QUIT PID' and see if anything shows in the .out file
> for hbase?
>
> It looks like a master is hung for sure.   My guess would be that
> we're making some presumption based on presence of append.  Lets
> figure it and fix.
>
> Good on you V,
> St.Ack
>
>

Re: Unresponsive master in Hbase 0.90.0

Reply via email to