So seems like the issue also comes out just after a log roll. (?) So we no longer have the old WAL file and still that write op try to write to old file? From the WAL file path name u can confirm this
-Anoop- On Wed, Mar 23, 2016 at 6:14 PM, Pankaj kr <[email protected]> wrote: > Thanks Anoop for replying.. > > No explicit close op happened on the WAL file (this log was rolled few sec > before). As per HDFS log, there is no close call to this WAL file. > > > Same issue happened again on 19th March, > > Here WAL was rolled just before the issue happened, > 2016-03-19 05:38:07,153 | INFO | > regionserver/RS-HOSTNAME/RS-IP:21302.logRoller | Rolled WAL > /hbase/WALs/RS-HOSTNAME,21302,1458301420876/RS-HOSTNAME%2C21302%2C1458301420876.default.1458337083824 > with entries=6508, filesize=61.03 MB; new WAL > /hbase/WALs/RS-HOSTNAME,21302,1458301420876/RS-HOSTNAME%2C21302%2C1458301420876.default.1458337087136 > | > org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:972) > > And after some sec during sync op, > 2016-03-19 05:38:10,075 | ERROR | sync.1 | Error syncing, request close of > wal | > org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346) > java.nio.channels.ClosedChannelException > at > org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:208) > at > org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:142) > at > org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:545) > at > org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:490) > at > org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130) > at > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:190) > at > org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1342) > at java.lang.Thread.run(Thread.java:745) > 2016-03-19 05:38:10,076 | INFO | > regionserver/RS-HOSTNAME/RS-IP:21302.logRoller | Rolled WAL > /hbase/WALs/RS-HOSTNAME,21302,1458301420876/RS-HOSTNAME%2C21302%2C1458301420876.default.1458337087136 > with entries=6383, filesize=61.51 MB; new WAL > /hbase/WALs/RS-HOSTNAME,21302,1458301420876/RS-HOSTNAME%2C21302%2C1458301420876.default.1458337090049 > | > org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:972) > 2016-03-19 05:38:10,087 | FATAL | > regionserver/RS-HOSTNAME/RS-IP:21302.logRoller | ABORTING region server > RS-HOSTNAME,21302,1458301420876: IOE in log roller | > org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:2055) > java.nio.channels.ClosedChannelException > at > org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:208) > at > org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:142) > at > org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:545) > at > org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:490) > at > org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130) > at > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:190) > at > org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1342) > at java.lang.Thread.run(Thread.java:745) > 2016-03-19 05:38:10,088 | FATAL | > regionserver/RS-HOSTNAME/RS-IP`:21302.logRoller | RegionServer abort: loaded > coprocessors are: > [org.apache.hadoop.hbase.index.coprocessor.regionserver.IndexRegionObserver, > org.apache.hadoop.hbase.JMXListener, > org.apache.hadoop.hbase.index.coprocessor.wal.IndexWALObserver] | > org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:2063) > > Here also, no error details in DN/NN log. > > I am still checking this, will update if any findings. > > Regards, > Pankaj > > -----Original Message----- > From: Anoop John [mailto:[email protected]] > Sent: Wednesday, March 23, 2016 3:50 PM > To: [email protected] > Subject: Re: Region server getting aborted in every one or two days > > At the same time, any explicit close op happened on the WAL file? Any log > rolling? Can u check the logs to know this? May be check HDFS logs to know > abt the close calls to WAL file? > > -Anoop- > > On Wed, Mar 23, 2016 at 12:10 PM, Pankaj kr <[email protected]> wrote: >> Hi, >> >> In our production environment, RS is getting aborted in every one or two >> days with following exception. >> >> 2016-03-16 13:57:07,975 | FATAL | MemStoreFlusher.0 | ABORTING region >> server xyz-vm8,24502,1458034278600: Replay of WAL required. Forcing >> server shutdown | >> org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer >> .java:2055) >> org.apache.hadoop.hbase.DroppedSnapshotException: region: >> TB_WEBLOGIN_201603,060,1457916997964.06e204d3bc262b72820aa195fec23513. >> at >> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2423) >> at >> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2128) >> at >> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2090) >> at >> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1983) >> at >> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1909) >> at >> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:509) >> at >> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:470) >> at >> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$800(MemStoreFlusher.java:74) >> at >> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:259) >> at java.lang.Thread.run(Thread.java:745) >> Caused by: java.nio.channels.ClosedChannelException >> at >> org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:208) >> at >> org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:142) >> at >> org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:635) >> at >> org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:490) >> at >> org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130) >> at >> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:190) >> at >> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1342) >> ... 1 more >> >> I don't see any error info at HDFS side at that point of time. >> Have anyone faced this issue? >> >> HBase version is 0.98.6. >> >> Regards, >> Pankaj
