Well, it was short lived, it only stayed up for a couple hours, all region servers crashed this time, not just one.
Now, after restarting, I've got the master server complaining about not having executable permissions on "recovered.edits". Where is this file? Caused by: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.security.AccessControlException: Permission denied: user=mlcamus, access=EXECUTE, inode="recovered.edits":mlcamus:supergroup:rw-r--r-- The message has repeated for a half hour, with this showing up in one region server: 2010-09-09 04:52:34,887 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; -ROOT-,,0 I assume this will get better if I change permissions of some file... which one? -Matthew On Sep 8, 2010, at 6:21 PM, Matthew LeMieux wrote: > I tried moving that file to tmp. It appears as though the master is no > longer stuck, but clients are still not able to run queries. > > There aren't any messages passing by in the log files (just routine messages > I see when the server isn't doing anything), but attempts to run queries > resulted in not server region exceptions (i.e., count 'table'). > > I tried enable 'table', and found that after this command there was a huge > amount of activity in the log files, and I was able to run queries again. > > There was no previous call to disable 'table', but for some reason HBase > wasn't bringing tables/regions online. > > I'm not sure what caused the problem or even if the actions I took will fix > it again in the future, but I am back up and running for now. > > FYI, > > -Matthew > > On Sep 8, 2010, at 6:00 PM, Matthew LeMieux wrote: > >> My HBase cluster just crashed. One of the Region servers stopped (do not >> yet know why). After restarting it, the cluster seemed a but wobbly, so I >> decided to shutdown everything, and restart fresh. I did so (including >> zookeeper and HDFS). >> >> Upon restart, I'm getting the following message in the Master's log file >> repeating continuously with the number of ms waited counting up. >> >> 2010-09-09 00:54:58,406 WARN org.apache.hadoop.hbase.util.FSUtils: Waited >> 69188ms for lease recovery on >> hdfs://domU-12-31-39-18-12-05.compute-1.internal:9000/hbase/.logs/domU-12-31-39-0C-38-31.compute-1.internal,60020,1283905848540/10.215.59.191%3A60020.1283905909298:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: >> failed to create file >> /hbase/.logs/domU-12-31-39-0C-38-31.compute-1.internal,60020,1283905848540/10.215.59.191%3A60020.1283905909298 >> for DFSClient_hb_m_10.104.37.247:60000 on client 10.104.37.247 because >> current leaseholder is trying to recreate file. >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1068) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1181) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:422) >> at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512) >> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968) >> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964) >> at java.security.AccessController.doPrivileged(Native Method) >> at javax.security.auth.Subject.doAs(Subject.java:396) >> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962) >> >> >> The region servers are waiting with this being the final message in their >> log file: >> >> 2010-09-09 00:53:49,111 INFO >> org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at >> 10.104.37.247:60000 that we are up >> >> I've been using this version for a little under a week without incident >> (http://people.apache.org/~jdcryans/hbase-0.89.20100830-candidate-1/ ). >> >> The HDFS comes from CDH3. >> >> Does anybody have any ideas on what I can do to get back up and running? >> >> Thank you, >> >> Matthew >> >
