Looking closer at it I guess the flush and the IOException probably weren't related. So the multi call to delete must have failed at the client (which is good). It does seem very strange to me that the pattern always seemed to be 1) start flushing, 2) fail with Cannot append. This happens a handful of times. So we're still trying to find reasons for this loss of data.
On Tue, Sep 18, 2012 at 2:37 PM, Bryan Beaudreault <[email protected] > wrote: > We are running cdh3u2 on a 150 node cluster, where 50 are HBase and 100 > are map reduce. The underlying hdfs spans all nodes. > > We just had to restart one of our region servers because it started having > problems serving data. Looking back on the logs, while the restart was > happening and the memstores were being flushed we saw a bunch of messages > like this: > > 12/09/18 12:34:00 DEBUG regionserver.HRegion: Started memstore flush for >> visitor-activities-c7,\x00\x00\x00\x00n\x8E62608,1345736694990.fe4fe1d88a1b5cbcb9cdb6121e902f76., >> current region memstore size 5.8m >> 12/09/18 12:34:00 DEBUG regionserver.HRegion: Finished snapshotting, >> commencing flushing stores >> 12/09/18 12:34:00 DEBUG wal.HLog: Moved 34 log files to /hbase/.oldlogs >> 12/09/18 12:34:00 INFO regionserver.HRegionServer: stopping server at: >> SERVER_IP,60020,1347074761294 >> 12/09/18 12:34:00 DEBUG catalog.CatalogTracker: Stopping catalog tracker >> org.apache.hadoop.hbase.catalog.CatalogTracker@1f415056 >> 12/09/18 12:34:00 INFO regionserver.HRegionServer: Waiting on 223 regions >> to close >> 12/09/18 12:34:01 ERROR regionserver.HRegionServer: >> java.io.IOException: Cannot append; log is closed >> at >> org.apache.hadoop.hbase.regionserver.wal.HLog.append(HLog.java:912) >> at >> org.apache.hadoop.hbase.regionserver.HRegion.delete(HRegion.java:1317) >> at >> org.apache.hadoop.hbase.regionserver.HRegion.delete(HRegion.java:1237) >> at >> org.apache.hadoop.hbase.regionserver.HRegionServer.delete(HRegionServer.java:1957) >> at >> org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:2563) >> at sun.reflect.GeneratedMethodAccessor135.invoke(Unknown Source) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at >> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) >> at >> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039) > > > > We saw this for a bunch of regions. This may or may not be related (thats > what I'm trying to figure out), but now we are starting to see evidence > that some data may not have been persisted. For instance we use the result > of an increment value as the id for another record, and we are starting to > see the increment return values that we have already used as ids. > > Does this exception mean that we lost data? Is there any way to recover > it if so? I don't see any hlogs in the .corrupt folder and not sure how to > proceed. > > One important thing to note is that we were decommissioning 50 of the 150 > datanodes at the time, using the dfs.exclude.hosts setting in the namenode. > I thought that was supposed to be a safe operation, so hopefully it didn't > cause this. > > Thanks > > >
