Thanks Anoop for replying..
No explicit close op happened on the WAL file (this log was rolled few sec
before). As per HDFS log, there is no close call to this WAL file.
Same issue happened again on 19th March,
Here WAL was rolled just before the issue happened,
2016-03-19 05:38:07,153 | INFO |
regionserver/RS-HOSTNAME/RS-IP:21302.logRoller | Rolled WAL
/hbase/WALs/RS-HOSTNAME,21302,1458301420876/RS-HOSTNAME%2C21302%2C1458301420876.default.1458337083824
with entries=6508, filesize=61.03 MB; new WAL
/hbase/WALs/RS-HOSTNAME,21302,1458301420876/RS-HOSTNAME%2C21302%2C1458301420876.default.1458337087136
|
org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:972)
And after some sec during sync op,
2016-03-19 05:38:10,075 | ERROR | sync.1 | Error syncing, request close of wal
|
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
java.nio.channels.ClosedChannelException
at
org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:208)
at
org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:142)
at
org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:545)
at
org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:490)
at
org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130)
at
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:190)
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1342)
at java.lang.Thread.run(Thread.java:745)
2016-03-19 05:38:10,076 | INFO |
regionserver/RS-HOSTNAME/RS-IP:21302.logRoller | Rolled WAL
/hbase/WALs/RS-HOSTNAME,21302,1458301420876/RS-HOSTNAME%2C21302%2C1458301420876.default.1458337087136
with entries=6383, filesize=61.51 MB; new WAL
/hbase/WALs/RS-HOSTNAME,21302,1458301420876/RS-HOSTNAME%2C21302%2C1458301420876.default.1458337090049
|
org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:972)
2016-03-19 05:38:10,087 | FATAL |
regionserver/RS-HOSTNAME/RS-IP:21302.logRoller | ABORTING region server
RS-HOSTNAME,21302,1458301420876: IOE in log roller |
org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:2055)
java.nio.channels.ClosedChannelException
at
org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:208)
at
org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:142)
at
org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:545)
at
org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:490)
at
org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130)
at
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:190)
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1342)
at java.lang.Thread.run(Thread.java:745)
2016-03-19 05:38:10,088 | FATAL |
regionserver/RS-HOSTNAME/RS-IP`:21302.logRoller | RegionServer abort: loaded
coprocessors are:
[org.apache.hadoop.hbase.index.coprocessor.regionserver.IndexRegionObserver,
org.apache.hadoop.hbase.JMXListener,
org.apache.hadoop.hbase.index.coprocessor.wal.IndexWALObserver] |
org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:2063)
Here also, no error details in DN/NN log.
I am still checking this, will update if any findings.
Regards,
Pankaj
-----Original Message-----
From: Anoop John [mailto:[email protected]]
Sent: Wednesday, March 23, 2016 3:50 PM
To: [email protected]
Subject: Re: Region server getting aborted in every one or two days
At the same time, any explicit close op happened on the WAL file? Any log
rolling? Can u check the logs to know this? May be check HDFS logs to know
abt the close calls to WAL file?
-Anoop-
On Wed, Mar 23, 2016 at 12:10 PM, Pankaj kr <[email protected]> wrote:
> Hi,
>
> In our production environment, RS is getting aborted in every one or two days
> with following exception.
>
> 2016-03-16 13:57:07,975 | FATAL | MemStoreFlusher.0 | ABORTING region
> server xyz-vm8,24502,1458034278600: Replay of WAL required. Forcing
> server shutdown |
> org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer
> .java:2055)
> org.apache.hadoop.hbase.DroppedSnapshotException: region:
> TB_WEBLOGIN_201603,060,1457916997964.06e204d3bc262b72820aa195fec23513.
> at
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2423)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2128)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2090)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1983)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1909)
> at
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:509)
> at
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:470)
> at
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$800(MemStoreFlusher.java:74)
> at
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:259)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.nio.channels.ClosedChannelException
> at
> org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:208)
> at
> org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:142)
> at
> org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:635)
> at
> org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:490)
> at
> org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130)
> at
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:190)
> at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1342)
> ... 1 more
>
> I don't see any error info at HDFS side at that point of time.
> Have anyone faced this issue?
>
> HBase version is 0.98.6.
>
> Regards,
> Pankaj