We are running cdh3u2 on a 150 node cluster, where 50 are HBase and 100 are map reduce. The underlying hdfs spans all nodes.
We just had to restart one of our region servers because it started having problems serving data. Looking back on the logs, while the restart was happening and the memstores were being flushed we saw a bunch of messages like this: 12/09/18 12:34:00 DEBUG regionserver.HRegion: Started memstore flush for > visitor-activities-c7,\x00\x00\x00\x00n\x8E62608,1345736694990.fe4fe1d88a1b5cbcb9cdb6121e902f76., > current region memstore size 5.8m > 12/09/18 12:34:00 DEBUG regionserver.HRegion: Finished snapshotting, > commencing flushing stores > 12/09/18 12:34:00 DEBUG wal.HLog: Moved 34 log files to /hbase/.oldlogs > 12/09/18 12:34:00 INFO regionserver.HRegionServer: stopping server at: > SERVER_IP,60020,1347074761294 > 12/09/18 12:34:00 DEBUG catalog.CatalogTracker: Stopping catalog tracker > org.apache.hadoop.hbase.catalog.CatalogTracker@1f415056 > 12/09/18 12:34:00 INFO regionserver.HRegionServer: Waiting on 223 regions > to close > 12/09/18 12:34:01 ERROR regionserver.HRegionServer: > java.io.IOException: Cannot append; log is closed > at > org.apache.hadoop.hbase.regionserver.wal.HLog.append(HLog.java:912) > at > org.apache.hadoop.hbase.regionserver.HRegion.delete(HRegion.java:1317) > at > org.apache.hadoop.hbase.regionserver.HRegion.delete(HRegion.java:1237) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.delete(HRegionServer.java:1957) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:2563) > at sun.reflect.GeneratedMethodAccessor135.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039) We saw this for a bunch of regions. This may or may not be related (thats what I'm trying to figure out), but now we are starting to see evidence that some data may not have been persisted. For instance we use the result of an increment value as the id for another record, and we are starting to see the increment return values that we have already used as ids. Does this exception mean that we lost data? Is there any way to recover it if so? I don't see any hlogs in the .corrupt folder and not sure how to proceed. One important thing to note is that we were decommissioning 50 of the 150 datanodes at the time, using the dfs.exclude.hosts setting in the namenode. I thought that was supposed to be a safe operation, so hopefully it didn't cause this. Thanks
