[jira] [Updated] (HBASE-17381) ReplicationSourceWorkerThread can die due to unhandled exceptions

2017-11-09 Thread Andrew Purtell (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-17381:
---
Fix Version/s: 1.4.0

> ReplicationSourceWorkerThread can die due to unhandled exceptions
> -
>
> Key: HBASE-17381
> URL: https://issues.apache.org/jira/browse/HBASE-17381
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Reporter: Gary Helmling
>Assignee: Zheng Hu
> Fix For: 2.0.0, 1.4.0, 1.3.1, 1.2.5
>
> Attachments: HBASE-17381.patch, HBASE-17381.v1.patch, 
> HBASE-17381.v2.patch, HBASE-17381.v3.patch
>
>
> If a ReplicationSourceWorkerThread encounters an unexpected exception in the 
> run() method (for example failure to allocate direct memory for the DFS 
> client), the exception will be logged by the UncaughtExceptionHandler, but 
> the thread will also die and the replication queue will back up indefinitely 
> until the Regionserver is restarted.
> We should make sure the worker thread is resilient to all exceptions that it 
> can actually handle.  For those that it really can't, it seems better to 
> abort the regionserver rather than just allow replication to stop with 
> minimal signal.
> Here is a sample exception:
> {noformat}
> ERROR regionserver.ReplicationSource: Unexpected exception in 
> ReplicationSourceWorkerThread, 
> currentPath=hdfs://.../hbase/WALs/XXXwalfilenameXXX
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:693)
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:96)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:113)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:108)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.createStreamPair(DataTransferSaslUtil.java:344)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:490)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:391)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
> at 
> org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:92)
> at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:778)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:695)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:356)
> at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:673)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:308)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-17381) ReplicationSourceWorkerThread can die due to unhandled exceptions

2017-11-08 Thread Andrew Purtell (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-17381:
---
Fix Version/s: (was: 1.4.0)

> ReplicationSourceWorkerThread can die due to unhandled exceptions
> -
>
> Key: HBASE-17381
> URL: https://issues.apache.org/jira/browse/HBASE-17381
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Reporter: Gary Helmling
>Assignee: Zheng Hu
> Fix For: 2.0.0, 1.3.1, 1.2.5
>
> Attachments: HBASE-17381.patch, HBASE-17381.v1.patch, 
> HBASE-17381.v2.patch, HBASE-17381.v3.patch
>
>
> If a ReplicationSourceWorkerThread encounters an unexpected exception in the 
> run() method (for example failure to allocate direct memory for the DFS 
> client), the exception will be logged by the UncaughtExceptionHandler, but 
> the thread will also die and the replication queue will back up indefinitely 
> until the Regionserver is restarted.
> We should make sure the worker thread is resilient to all exceptions that it 
> can actually handle.  For those that it really can't, it seems better to 
> abort the regionserver rather than just allow replication to stop with 
> minimal signal.
> Here is a sample exception:
> {noformat}
> ERROR regionserver.ReplicationSource: Unexpected exception in 
> ReplicationSourceWorkerThread, 
> currentPath=hdfs://.../hbase/WALs/XXXwalfilenameXXX
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:693)
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:96)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:113)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:108)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.createStreamPair(DataTransferSaslUtil.java:344)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:490)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:391)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
> at 
> org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:92)
> at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:778)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:695)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:356)
> at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:673)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:308)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-17381) ReplicationSourceWorkerThread can die due to unhandled exceptions

2017-02-07 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-17381:
--
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 1.2.5
   1.3.1
   1.4.0
   2.0.0
   Status: Resolved  (was: Patch Available)

Committed to branch-1.2+.  This would require some substantial rework for 
branch-1.1.

Thanks for the fix [~openinx]!

> ReplicationSourceWorkerThread can die due to unhandled exceptions
> -
>
> Key: HBASE-17381
> URL: https://issues.apache.org/jira/browse/HBASE-17381
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Reporter: Gary Helmling
>Assignee: huzheng
> Fix For: 2.0.0, 1.4.0, 1.3.1, 1.2.5
>
> Attachments: HBASE-17381.patch, HBASE-17381.v1.patch, 
> HBASE-17381.v2.patch, HBASE-17381.v3.patch
>
>
> If a ReplicationSourceWorkerThread encounters an unexpected exception in the 
> run() method (for example failure to allocate direct memory for the DFS 
> client), the exception will be logged by the UncaughtExceptionHandler, but 
> the thread will also die and the replication queue will back up indefinitely 
> until the Regionserver is restarted.
> We should make sure the worker thread is resilient to all exceptions that it 
> can actually handle.  For those that it really can't, it seems better to 
> abort the regionserver rather than just allow replication to stop with 
> minimal signal.
> Here is a sample exception:
> {noformat}
> ERROR regionserver.ReplicationSource: Unexpected exception in 
> ReplicationSourceWorkerThread, 
> currentPath=hdfs://.../hbase/WALs/XXXwalfilenameXXX
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:693)
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:96)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:113)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:108)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.createStreamPair(DataTransferSaslUtil.java:344)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:490)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:391)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
> at 
> org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:92)
> at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:778)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:695)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:356)
> at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:673)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:308)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-17381) ReplicationSourceWorkerThread can die due to unhandled exceptions

2017-02-05 Thread huzheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huzheng updated HBASE-17381:

Attachment: HBASE-17381.v3.patch

> ReplicationSourceWorkerThread can die due to unhandled exceptions
> -
>
> Key: HBASE-17381
> URL: https://issues.apache.org/jira/browse/HBASE-17381
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Reporter: Gary Helmling
>Assignee: huzheng
> Attachments: HBASE-17381.patch, HBASE-17381.v1.patch, 
> HBASE-17381.v2.patch, HBASE-17381.v3.patch
>
>
> If a ReplicationSourceWorkerThread encounters an unexpected exception in the 
> run() method (for example failure to allocate direct memory for the DFS 
> client), the exception will be logged by the UncaughtExceptionHandler, but 
> the thread will also die and the replication queue will back up indefinitely 
> until the Regionserver is restarted.
> We should make sure the worker thread is resilient to all exceptions that it 
> can actually handle.  For those that it really can't, it seems better to 
> abort the regionserver rather than just allow replication to stop with 
> minimal signal.
> Here is a sample exception:
> {noformat}
> ERROR regionserver.ReplicationSource: Unexpected exception in 
> ReplicationSourceWorkerThread, 
> currentPath=hdfs://.../hbase/WALs/XXXwalfilenameXXX
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:693)
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:96)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:113)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:108)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.createStreamPair(DataTransferSaslUtil.java:344)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:490)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:391)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
> at 
> org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:92)
> at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:778)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:695)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:356)
> at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:673)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:308)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-17381) ReplicationSourceWorkerThread can die due to unhandled exceptions

2017-01-05 Thread huzheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huzheng updated HBASE-17381:

Assignee: huzheng
  Status: Patch Available  (was: Open)

> ReplicationSourceWorkerThread can die due to unhandled exceptions
> -
>
> Key: HBASE-17381
> URL: https://issues.apache.org/jira/browse/HBASE-17381
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Reporter: Gary Helmling
>Assignee: huzheng
> Attachments: HBASE-17381.patch
>
>
> If a ReplicationSourceWorkerThread encounters an unexpected exception in the 
> run() method (for example failure to allocate direct memory for the DFS 
> client), the exception will be logged by the UncaughtExceptionHandler, but 
> the thread will also die and the replication queue will back up indefinitely 
> until the Regionserver is restarted.
> We should make sure the worker thread is resilient to all exceptions that it 
> can actually handle.  For those that it really can't, it seems better to 
> abort the regionserver rather than just allow replication to stop with 
> minimal signal.
> Here is a sample exception:
> {noformat}
> ERROR regionserver.ReplicationSource: Unexpected exception in 
> ReplicationSourceWorkerThread, 
> currentPath=hdfs://.../hbase/WALs/XXXwalfilenameXXX
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:693)
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:96)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:113)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:108)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.createStreamPair(DataTransferSaslUtil.java:344)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:490)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:391)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
> at 
> org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:92)
> at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:778)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:695)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:356)
> at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:673)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:308)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-17381) ReplicationSourceWorkerThread can die due to unhandled exceptions

2017-01-05 Thread huzheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huzheng updated HBASE-17381:

Attachment: HBASE-17381.patch

> ReplicationSourceWorkerThread can die due to unhandled exceptions
> -
>
> Key: HBASE-17381
> URL: https://issues.apache.org/jira/browse/HBASE-17381
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Reporter: Gary Helmling
> Attachments: HBASE-17381.patch
>
>
> If a ReplicationSourceWorkerThread encounters an unexpected exception in the 
> run() method (for example failure to allocate direct memory for the DFS 
> client), the exception will be logged by the UncaughtExceptionHandler, but 
> the thread will also die and the replication queue will back up indefinitely 
> until the Regionserver is restarted.
> We should make sure the worker thread is resilient to all exceptions that it 
> can actually handle.  For those that it really can't, it seems better to 
> abort the regionserver rather than just allow replication to stop with 
> minimal signal.
> Here is a sample exception:
> {noformat}
> ERROR regionserver.ReplicationSource: Unexpected exception in 
> ReplicationSourceWorkerThread, 
> currentPath=hdfs://.../hbase/WALs/XXXwalfilenameXXX
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:693)
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:96)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:113)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:108)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.createStreamPair(DataTransferSaslUtil.java:344)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:490)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:391)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
> at 
> org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:92)
> at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:778)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:695)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:356)
> at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:673)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:308)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-17381) ReplicationSourceWorkerThread can die due to unhandled exceptions

2017-01-05 Thread huzheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huzheng updated HBASE-17381:

Status: Open  (was: Patch Available)

> ReplicationSourceWorkerThread can die due to unhandled exceptions
> -
>
> Key: HBASE-17381
> URL: https://issues.apache.org/jira/browse/HBASE-17381
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Reporter: Gary Helmling
>
> If a ReplicationSourceWorkerThread encounters an unexpected exception in the 
> run() method (for example failure to allocate direct memory for the DFS 
> client), the exception will be logged by the UncaughtExceptionHandler, but 
> the thread will also die and the replication queue will back up indefinitely 
> until the Regionserver is restarted.
> We should make sure the worker thread is resilient to all exceptions that it 
> can actually handle.  For those that it really can't, it seems better to 
> abort the regionserver rather than just allow replication to stop with 
> minimal signal.
> Here is a sample exception:
> {noformat}
> ERROR regionserver.ReplicationSource: Unexpected exception in 
> ReplicationSourceWorkerThread, 
> currentPath=hdfs://.../hbase/WALs/XXXwalfilenameXXX
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:693)
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:96)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:113)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:108)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.createStreamPair(DataTransferSaslUtil.java:344)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:490)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:391)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
> at 
> org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:92)
> at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:778)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:695)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:356)
> at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:673)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:308)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-17381) ReplicationSourceWorkerThread can die due to unhandled exceptions

2017-01-05 Thread huzheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huzheng updated HBASE-17381:

Status: Patch Available  (was: Open)

> ReplicationSourceWorkerThread can die due to unhandled exceptions
> -
>
> Key: HBASE-17381
> URL: https://issues.apache.org/jira/browse/HBASE-17381
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Reporter: Gary Helmling
>
> If a ReplicationSourceWorkerThread encounters an unexpected exception in the 
> run() method (for example failure to allocate direct memory for the DFS 
> client), the exception will be logged by the UncaughtExceptionHandler, but 
> the thread will also die and the replication queue will back up indefinitely 
> until the Regionserver is restarted.
> We should make sure the worker thread is resilient to all exceptions that it 
> can actually handle.  For those that it really can't, it seems better to 
> abort the regionserver rather than just allow replication to stop with 
> minimal signal.
> Here is a sample exception:
> {noformat}
> ERROR regionserver.ReplicationSource: Unexpected exception in 
> ReplicationSourceWorkerThread, 
> currentPath=hdfs://.../hbase/WALs/XXXwalfilenameXXX
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:693)
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:96)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:113)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:108)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.createStreamPair(DataTransferSaslUtil.java:344)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:490)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:391)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
> at 
> org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:92)
> at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:778)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:695)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:356)
> at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:673)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:308)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-17381) ReplicationSourceWorkerThread can die due to unhandled exceptions

2016-12-27 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-17381:
--
Component/s: Replication

> ReplicationSourceWorkerThread can die due to unhandled exceptions
> -
>
> Key: HBASE-17381
> URL: https://issues.apache.org/jira/browse/HBASE-17381
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Reporter: Gary Helmling
>
> If a ReplicationSourceWorkerThread encounters an unexpected exception in the 
> run() method (for example failure to allocate direct memory for the DFS 
> client), the exception will be logged by the UncaughtExceptionHandler, but 
> the thread will also die and the replication queue will back up indefinitely 
> until the Regionserver is restarted.
> We should make sure the worker thread is resilient to all exceptions that it 
> can actually handle.  For those that it really can't, it seems better to 
> abort the regionserver rather than just allow replication to stop with 
> minimal signal.
> Here is a sample exception:
> {noformat}
> ERROR regionserver.ReplicationSource: Unexpected exception in 
> ReplicationSourceWorkerThread, 
> currentPath=hdfs://.../hbase/WALs/XXXwalfilenameXXX
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:693)
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:96)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:113)
> at 
> org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:108)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.createStreamPair(DataTransferSaslUtil.java:344)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:490)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:391)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
> at 
> org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:92)
> at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:778)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:695)
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:356)
> at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:673)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:308)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)