We just hit the same issue. I attached log snippets from the regionserver and master into https://issues.apache.org/jira/browse/HBASE-4107
I was able to get the log file out of hdfs. Is there a location I can put it back in to have it picked up? Dave On Fri, Jul 15, 2011 at 12:23 PM, Andy Sautins <[email protected]>wrote: > > I don't have the log still. Not sure what I was thinking deleting it. I > was a little too aggressive wanting to get my fsck back to having 0 corrupt > blocks. > > What you say is interesting. It's more than possible that I'm > misunderstanding what is going on. > > What we saw with the log file is that we could cat it, but couldn't copy > the file ( would complain about a bad checksum ). I know that's not hard > data, but going by that what you say about applying the log up until the > last sync makes would make sense. What might have thrown me is after a > re-start the logs ( including the corrupt log ) were still in the .logs > folder. We did a full shutdown/restart and the following stacktrace was in > the master logs. After this stacktrace hbase continued to startup, however > the logs ( all logs up until the corrupt log ) for the region with the > corrupt log file were left in the .logs directory. When we removed the > corrupt log file and re-started again all the existing logs were removed > after successful restart as I would expect. > > So is it more likely that the error on shutdown is reasonable and that > the log cleanup just didn't happen on startup? I suppose it makes sense not > to remove them if there is an error, but it did throw me that the corrupt > file as well as previous files were still in the .logs directory. > > 2011-07-14 18:07:45,954 ERROR > org.apache.hadoop.hbase.master.MasterFileSystem: Failed splitting hdfs:// > hdnn.dfs.returnpath.net:8020/user/hbase/.logs/hd31.dfs.returnpath.net,60020,1309294522164 > org.apache.hadoop.fs.ChecksumException: Checksum error: > /blk_-8148723766791273697:of:/user/hbase/.logs/hd31.dfs.returnpath.net > ,60020,1309294522164/hd31.dfs.returnpath.net%3A60020.1310675410770 at > 57790464 > at > org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277) > at > org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241) > at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176) > at > org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193) > at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158) > at > org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1249) > at > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1899) > at > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1951) > at java.io.DataInputStream.read(DataInputStream.java:132) > at java.io.DataInputStream.readFully(DataInputStream.java:178) > at > org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63) > at > org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101) > at > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1945) > at > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1845) > at > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1891) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:198) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:172) > at > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.parseHLog(HLogSplitter.java:429) > at > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:262) > at > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:188) > at > org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:197) > at > org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:181) > at > org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:384) > at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283) > > -----Original Message----- > From: [email protected] [mailto:[email protected]] On Behalf Of Stack > Sent: Friday, July 15, 2011 12:59 PM > To: [email protected] > Subject: Re: corrupt WAL and Java Heap Space... > > I'd have expected the log to be recoverable up to the last time you > called sync. What were you seeing? Do you have the log still? (It > should recover to the last edit) > > St.Ack > > On Fri, Jul 15, 2011 at 11:32 AM, Andy Sautins > <[email protected]> wrote: > > > > Thanks. I filed JIRA HBASE-4107 ( > https://issues.apache.org/jira/browse/HBASE-4107 ). > > > > It does seem like the OOME is causing a write to the WAL to be left in > an inconsistent state. I haven't had a chance to look yet, but it would > seem that the flush isn't atomic, so possibly the data was synced but the > checksum wasn't able to be updated. If that logic is right then it would be > an issue in the sync to hdfs. > > > > In either case it is sad that the log looks like it could get left in an > unusable state. That seems like the last thing we'd really want. Not sure > about keeping a reservoir of memory around. It seems you could free just > about anything to let the write finish and then exit potentially > ungracefully. The WAL would need to be recovered, but that's much > preferable to data loss. > > > > I need to look further but it does feel like the full sync is not atomic > and failing somewhere before the checksum is fully written out can > potentially lead to WAL corruption. That's a guess. I need to look at it > further. > > > > Thanks > > > > Andy > > > > -----Original Message----- > > From: [email protected] [mailto:[email protected]] On Behalf Of > Stack > > Sent: Friday, July 15, 2011 10:41 AM > > To: [email protected] > > Subject: Re: corrupt WAL and Java Heap Space... > > > > Please file an issue. Sounds like an OOME while writing causes us to > > exit w/o closing the WAL (You think that the case)? My guess is that > > in this low memory situation, a close might fail anyways (with another > > OOME) unless we did some extra gymnastics releasing the little > > resevoir of memory we keep around to release so cleanup succeeds > > whenever we see OOME. > > > > St.Ack > > > > On Fri, Jul 15, 2011 at 9:32 AM, Andy Sautins > > <[email protected]> wrote: > >> > >> Yesterday we ran into an interesting issue. We were shutting down our > HBase cluster ( 0.90.1 CDH3u0 ) and in the process one of the nodes > encountered a Java heap space exception. The bummer is the log file was > listed as corrupt from hadoop fsck and was unable to be read when > re-starting the database. We were able to recover in our situation by > removing the corrupt log and did not appear to lose any data. > >> > >> Has anyone else seen this issue? If I'm reading the situation right > it looks like that a Java heap space error during the WAL checksum write > could leave the WAL corrupt which doesn't seem like desired behavior. > >> > >> I'll looking into it further but any thoughts would be appreciated. > >> > >> > >> 2011-07-14 14:54:53,741 FATAL > org.apache.hadoop.hbase.regionserver.wal.HLog: Could not append. Requesting > close of hlog > >> java.io.IOException: Reflection > >> at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:147) > >> at > org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:987) > >> at > org.apache.hadoop.hbase.regionserver.wal.HLog$LogSyncer.run(HLog.java:964) > >> Caused by: java.lang.reflect.InvocationTargetException > >> at sun.reflect.GeneratedMethodAccessor1336.invoke(Unknown Source) > >> at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > >> at java.lang.reflect.Method.invoke(Method.java:597) > >> at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:145) > >> ... 2 more > >> Caused by: java.lang.OutOfMemoryError: Java heap space > >> at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$Packet.<init>(DFSClient.java:2375) > >> at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:3271) > >> at > org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:150) > >> at > org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:132) > >> at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3354) > >> at > org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97) > >> at > org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:944) > >> ... 6 more > >> > >> > > >
