>> Is this new cluster start or master joining an already running cluster (looks
>> like former).

Either way, I get this problem. In particular, these logs were pulled out after 
I had done a createTable with boundaries (around 100 empty regions per node) 
and shut it down and then restarted. A similar thing happened when master was 
restarted on a running cluster.

>> Can you fix this?  Can you run w/ a working append?
I will run with hadoop append and see what happens.

>> Can you do 'kill -QUIT PID' and see if anything shows in the .out file for 
>> hbase?
>> It looks like a master is hung for sure.   My guess would be that
>> we're making some presumption based on presence of append.  Lets
>>figure it and fix.

Some updates on the what happened with the master:
5 hours after the master was sleeping, I saw some activity in the logs (a 
thread timeout I suppose): Note the first line here was the last line in the 
log snippet I posted in the previous mail. I checked whether there was any 
mistake with the gc: I looked at the vmstat/gc logs and I couldn't find any 
swaps/ serious gc happening. The master has crashed with a number

2011-01-28 07:35:49,877 INFO 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Waiting for split writer 
threads to finish
2011-01-28 12:12:55,309 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 
16625510ms instead of 10000ms, this is likely due to a long garbage collecting 
pause and it's usually bad, see 
http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9
2011-01-28 12:12:55,309 INFO org.apache.zookeeper.ClientCnxn: Client session 
timed out, have not heard from server in 16629300ms for sessionid 
0x32dcb848b98010e, closing socket connection and attempting reconnect
2011-01-28 12:12:55,309 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 
16674993ms instead of 60000ms, this is likely due to a long garbage collecting 
pause and it's usually bad, see 
http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9
2011-01-28 12:12:55,310 INFO org.apache.zookeeper.ClientCnxn: Client session 
timed out, have not heard from server in 16635680ms for sessionid 
0x2dcb848b210138, closing socket connection and attempting reconnect
2011-01-28 12:12:55,310 INFO 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Split writers finished
2011-01-28 12:12:55,310 INFO 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: hlog file splitting 
completed in 16625444 ms for 
hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615
2011-01-28 12:12:55,310 INFO org.apache.hadoop.hbase.master.MasterFileSystem: 
Log folder 
hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296185180336
 doesn't belong to a known region server, splitting
2011-01-28 12:12:55,315 WARN org.apache.hadoop.hbase.master.LogCleaner: Error 
while cleaning the logsjava.io.IOException: Call to 
b3110120.yst.yahoo.net/67.195.46.238:4600 failed on local exception: 
java.io.InterruptedIOException: Interruped while waiting for IO on channel 
java.nio.channels.SocketChannel[connected local=/67.195.48.110:56229 
remote=b3110120.yst.yahoo.net/67.195.46.238:4600]. 59999 millis timeout left.
        at org.apache.hadoop.ipc.Client.wrapException(Client.java:776)
        at org.apache.hadoop.ipc.Client.call(Client.java:744)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
        at $Proxy5.getListing(Unknown Source)
        at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
        at $Proxy5.getListing(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:606)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:252)
        at org.apache.hadoop.hbase.master.LogCleaner.chore(LogCleaner.java:128)
        at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
        at org.apache.hadoop.hbase.master.LogCleaner.run(LogCleaner.java:167)

And after that a lot of exceptions like the following:
2011-01-28 12:12:55,464 INFO org.apache.hadoop.hbase.master.MasterFileSystem: 
Log folder hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110566.yst.yahoo.ne
t,60020,1296199618571 belongs to an existing region server
2011-01-28 12:12:55,464 INFO org.apache.hadoop.hbase.master.MasterFileSystem: 
Log folder hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110567.yst.yahoo.ne
t,60020,1296183218152 doesn't belong to a known region server, splitting
2011-01-28 12:12:55,464 ERROR org.apache.hadoop.hbase.master.MasterFileSystem: 
Failed splitting hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110567.yst.y
ahoo.net,60020,1296183218152
java.io.IOException: Filesystem closed
        at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:222)
        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:613)     
   at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:643)        
at 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:177)
        at 
org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:196)
        at 
org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:180)
        at 
org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:378)
        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:277)
2011-01-28 12:12:55,464 INFO org.apache.hadoop.hbase.master.MasterFileSystem: 
Log folder hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110567.yst.yahoo.ne
t,60020,1296185180085 doesn't belong to a known region server, splitting
2011-01-28 12:12:55,464 ERROR org.apache.hadoop.hbase.master.MasterFileSystem: 
Failed splitting hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110567.yst.y
ahoo.net,60020,1296185180085
java.io.IOException: Filesystem closed
        at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:222)
        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:613)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:643)        
at 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:177)
        at 
org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:196)
        at 
org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:180)
        at 
org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:378)
        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:277)


Cheers
V


On 1/28/11 8:54 AM, "Stack" <[email protected]> wrote:

On Thu, Jan 27, 2011 at 11:56 PM, Vidhyashankar Venkataraman
<[email protected]> wrote:
> 2011-01-28 07:35:49,866 INFO org.apache.hadoop.hbase.master.MasterFileSystem: 
> Log folder 
> hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110270.yst.yahoo.net,60020,1296199618314
>  belongs to an existing region server
> 2011-01-28 07:35:49,866 INFO org.apache.hadoop.hbase.master.MasterFileSystem: 
> Log folder 
> hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615
>  doesn't belong to a known region server, splitting

Vidhya:

This looks like a master that is starting up.  Is that right? Is this
new cluster start or master joining an already running cluster (looks
like former).


> 2011-01-28 07:35:49,867 INFO 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting 1 hlog(s) in 
> hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615
> 2011-01-28 07:35:49,867 DEBUG 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread 
> Thread[WriterThread-0,5,main]: starting
> 2011-01-28 07:35:49,867 DEBUG 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread 
> Thread[WriterThread-1,5,main]: starting
> 2011-01-28 07:35:49,867 DEBUG 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 1 of 1: 
> hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266,
>  length=0
> 2011-01-28 07:35:49,867 DEBUG 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread 
> Thread[WriterThread-2,5,main]: starting
> 2011-01-28 07:35:49,867 WARN org.apache.hadoop.hbase.util.FSUtils: Running on 
> HDFS without append enabled may result in data loss


Can you fix this?  Can you run w/ a working append?


> 2011-01-28 07:35:49,867 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: File 
> hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266
>  might be still open, length is 0
> 2011-01-28 07:35:49,869 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Could not open 
> hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266
>  for reading. File is emptyjava.io.EOFException
> 2011-01-28 07:35:49,875 INFO 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Archived processed log 
> hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266
>  to 
> hdfs://b3110120.yst.yahoo.net:4600/hbase/.oldlogs/b3110271.yst.yahoo.net%3A60020.1296183219266
> 2011-01-28 07:35:49,877 INFO 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Waiting for split 
> writer threads to finish
>

Can you do 'kill -QUIT PID' and see if anything shows in the .out file
for hbase?

It looks like a master is hung for sure.   My guess would be that
we're making some presumption based on presence of append.  Lets
figure it and fix.

Good on you V,
St.Ack

Reply via email to