>> Is this new cluster start or master joining an already running cluster (looks >> like former).
Either way, I get this problem. In particular, these logs were pulled out after I had done a createTable with boundaries (around 100 empty regions per node) and shut it down and then restarted. A similar thing happened when master was restarted on a running cluster. >> Can you fix this? Can you run w/ a working append? I will run with hadoop append and see what happens. >> Can you do 'kill -QUIT PID' and see if anything shows in the .out file for >> hbase? >> It looks like a master is hung for sure. My guess would be that >> we're making some presumption based on presence of append. Lets >>figure it and fix. Some updates on the what happened with the master: 5 hours after the master was sleeping, I saw some activity in the logs (a thread timeout I suppose): Note the first line here was the last line in the log snippet I posted in the previous mail. I checked whether there was any mistake with the gc: I looked at the vmstat/gc logs and I couldn't find any swaps/ serious gc happening. The master has crashed with a number 2011-01-28 07:35:49,877 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Waiting for split writer threads to finish 2011-01-28 12:12:55,309 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 16625510ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9 2011-01-28 12:12:55,309 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 16629300ms for sessionid 0x32dcb848b98010e, closing socket connection and attempting reconnect 2011-01-28 12:12:55,309 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 16674993ms instead of 60000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9 2011-01-28 12:12:55,310 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 16635680ms for sessionid 0x2dcb848b210138, closing socket connection and attempting reconnect 2011-01-28 12:12:55,310 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Split writers finished 2011-01-28 12:12:55,310 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: hlog file splitting completed in 16625444 ms for hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615 2011-01-28 12:12:55,310 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296185180336 doesn't belong to a known region server, splitting 2011-01-28 12:12:55,315 WARN org.apache.hadoop.hbase.master.LogCleaner: Error while cleaning the logsjava.io.IOException: Call to b3110120.yst.yahoo.net/67.195.46.238:4600 failed on local exception: java.io.InterruptedIOException: Interruped while waiting for IO on channel java.nio.channels.SocketChannel[connected local=/67.195.48.110:56229 remote=b3110120.yst.yahoo.net/67.195.46.238:4600]. 59999 millis timeout left. at org.apache.hadoop.ipc.Client.wrapException(Client.java:776) at org.apache.hadoop.ipc.Client.call(Client.java:744) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy5.getListing(Unknown Source) at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy5.getListing(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:606) at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:252) at org.apache.hadoop.hbase.master.LogCleaner.chore(LogCleaner.java:128) at org.apache.hadoop.hbase.Chore.run(Chore.java:66) at org.apache.hadoop.hbase.master.LogCleaner.run(LogCleaner.java:167) And after that a lot of exceptions like the following: 2011-01-28 12:12:55,464 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110566.yst.yahoo.ne t,60020,1296199618571 belongs to an existing region server 2011-01-28 12:12:55,464 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110567.yst.yahoo.ne t,60020,1296183218152 doesn't belong to a known region server, splitting 2011-01-28 12:12:55,464 ERROR org.apache.hadoop.hbase.master.MasterFileSystem: Failed splitting hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110567.yst.y ahoo.net,60020,1296183218152 java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:222) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:613) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:643) at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:177) at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:196) at org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:180) at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:378) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:277) 2011-01-28 12:12:55,464 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110567.yst.yahoo.ne t,60020,1296185180085 doesn't belong to a known region server, splitting 2011-01-28 12:12:55,464 ERROR org.apache.hadoop.hbase.master.MasterFileSystem: Failed splitting hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110567.yst.y ahoo.net,60020,1296185180085 java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:222) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:613) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:643) at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:177) at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:196) at org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:180) at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:378) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:277) Cheers V On 1/28/11 8:54 AM, "Stack" <[email protected]> wrote: On Thu, Jan 27, 2011 at 11:56 PM, Vidhyashankar Venkataraman <[email protected]> wrote: > 2011-01-28 07:35:49,866 INFO org.apache.hadoop.hbase.master.MasterFileSystem: > Log folder > hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110270.yst.yahoo.net,60020,1296199618314 > belongs to an existing region server > 2011-01-28 07:35:49,866 INFO org.apache.hadoop.hbase.master.MasterFileSystem: > Log folder > hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615 > doesn't belong to a known region server, splitting Vidhya: This looks like a master that is starting up. Is that right? Is this new cluster start or master joining an already running cluster (looks like former). > 2011-01-28 07:35:49,867 INFO > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting 1 hlog(s) in > hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615 > 2011-01-28 07:35:49,867 DEBUG > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread > Thread[WriterThread-0,5,main]: starting > 2011-01-28 07:35:49,867 DEBUG > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread > Thread[WriterThread-1,5,main]: starting > 2011-01-28 07:35:49,867 DEBUG > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 1 of 1: > hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266, > length=0 > 2011-01-28 07:35:49,867 DEBUG > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread > Thread[WriterThread-2,5,main]: starting > 2011-01-28 07:35:49,867 WARN org.apache.hadoop.hbase.util.FSUtils: Running on > HDFS without append enabled may result in data loss Can you fix this? Can you run w/ a working append? > 2011-01-28 07:35:49,867 WARN > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: File > hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266 > might be still open, length is 0 > 2011-01-28 07:35:49,869 WARN > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Could not open > hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266 > for reading. File is emptyjava.io.EOFException > 2011-01-28 07:35:49,875 INFO > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Archived processed log > hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266 > to > hdfs://b3110120.yst.yahoo.net:4600/hbase/.oldlogs/b3110271.yst.yahoo.net%3A60020.1296183219266 > 2011-01-28 07:35:49,877 INFO > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Waiting for split > writer threads to finish > Can you do 'kill -QUIT PID' and see if anything shows in the .out file for hbase? It looks like a master is hung for sure. My guess would be that we're making some presumption based on presence of append. Lets figure it and fix. Good on you V, St.Ack
