[jira] [Updated] (HBASE-17381) ReplicationSourceWorkerThread can die due to unhandled exceptions
[ https://issues.apache.org/jira/browse/HBASE-17381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Purtell updated HBASE-17381: --- Fix Version/s: 1.4.0 > ReplicationSourceWorkerThread can die due to unhandled exceptions > - > > Key: HBASE-17381 > URL: https://issues.apache.org/jira/browse/HBASE-17381 > Project: HBase > Issue Type: Bug > Components: Replication >Reporter: Gary Helmling >Assignee: Zheng Hu > Fix For: 2.0.0, 1.4.0, 1.3.1, 1.2.5 > > Attachments: HBASE-17381.patch, HBASE-17381.v1.patch, > HBASE-17381.v2.patch, HBASE-17381.v3.patch > > > If a ReplicationSourceWorkerThread encounters an unexpected exception in the > run() method (for example failure to allocate direct memory for the DFS > client), the exception will be logged by the UncaughtExceptionHandler, but > the thread will also die and the replication queue will back up indefinitely > until the Regionserver is restarted. > We should make sure the worker thread is resilient to all exceptions that it > can actually handle. For those that it really can't, it seems better to > abort the regionserver rather than just allow replication to stop with > minimal signal. > Here is a sample exception: > {noformat} > ERROR regionserver.ReplicationSource: Unexpected exception in > ReplicationSourceWorkerThread, > currentPath=hdfs://.../hbase/WALs/XXXwalfilenameXXX > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:693) > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) > at > org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:96) > at > org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:113) > at > org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:108) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.createStreamPair(DataTransferSaslUtil.java:344) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:490) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:391) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160) > at > org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:92) > at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444) > at > org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:778) > at > org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:695) > at > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:356) > at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:673) > at > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934) > at java.io.DataInputStream.read(DataInputStream.java:100) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:308) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HBASE-17381) ReplicationSourceWorkerThread can die due to unhandled exceptions
[ https://issues.apache.org/jira/browse/HBASE-17381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Purtell updated HBASE-17381: --- Fix Version/s: (was: 1.4.0) > ReplicationSourceWorkerThread can die due to unhandled exceptions > - > > Key: HBASE-17381 > URL: https://issues.apache.org/jira/browse/HBASE-17381 > Project: HBase > Issue Type: Bug > Components: Replication >Reporter: Gary Helmling >Assignee: Zheng Hu > Fix For: 2.0.0, 1.3.1, 1.2.5 > > Attachments: HBASE-17381.patch, HBASE-17381.v1.patch, > HBASE-17381.v2.patch, HBASE-17381.v3.patch > > > If a ReplicationSourceWorkerThread encounters an unexpected exception in the > run() method (for example failure to allocate direct memory for the DFS > client), the exception will be logged by the UncaughtExceptionHandler, but > the thread will also die and the replication queue will back up indefinitely > until the Regionserver is restarted. > We should make sure the worker thread is resilient to all exceptions that it > can actually handle. For those that it really can't, it seems better to > abort the regionserver rather than just allow replication to stop with > minimal signal. > Here is a sample exception: > {noformat} > ERROR regionserver.ReplicationSource: Unexpected exception in > ReplicationSourceWorkerThread, > currentPath=hdfs://.../hbase/WALs/XXXwalfilenameXXX > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:693) > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) > at > org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:96) > at > org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:113) > at > org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:108) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.createStreamPair(DataTransferSaslUtil.java:344) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:490) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:391) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160) > at > org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:92) > at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444) > at > org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:778) > at > org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:695) > at > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:356) > at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:673) > at > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934) > at java.io.DataInputStream.read(DataInputStream.java:100) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:308) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HBASE-17381) ReplicationSourceWorkerThread can die due to unhandled exceptions
[ https://issues.apache.org/jira/browse/HBASE-17381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Helmling updated HBASE-17381: -- Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 1.2.5 1.3.1 1.4.0 2.0.0 Status: Resolved (was: Patch Available) Committed to branch-1.2+. This would require some substantial rework for branch-1.1. Thanks for the fix [~openinx]! > ReplicationSourceWorkerThread can die due to unhandled exceptions > - > > Key: HBASE-17381 > URL: https://issues.apache.org/jira/browse/HBASE-17381 > Project: HBase > Issue Type: Bug > Components: Replication >Reporter: Gary Helmling >Assignee: huzheng > Fix For: 2.0.0, 1.4.0, 1.3.1, 1.2.5 > > Attachments: HBASE-17381.patch, HBASE-17381.v1.patch, > HBASE-17381.v2.patch, HBASE-17381.v3.patch > > > If a ReplicationSourceWorkerThread encounters an unexpected exception in the > run() method (for example failure to allocate direct memory for the DFS > client), the exception will be logged by the UncaughtExceptionHandler, but > the thread will also die and the replication queue will back up indefinitely > until the Regionserver is restarted. > We should make sure the worker thread is resilient to all exceptions that it > can actually handle. For those that it really can't, it seems better to > abort the regionserver rather than just allow replication to stop with > minimal signal. > Here is a sample exception: > {noformat} > ERROR regionserver.ReplicationSource: Unexpected exception in > ReplicationSourceWorkerThread, > currentPath=hdfs://.../hbase/WALs/XXXwalfilenameXXX > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:693) > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) > at > org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:96) > at > org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:113) > at > org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:108) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.createStreamPair(DataTransferSaslUtil.java:344) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:490) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:391) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160) > at > org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:92) > at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444) > at > org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:778) > at > org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:695) > at > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:356) > at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:673) > at > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934) > at java.io.DataInputStream.read(DataInputStream.java:100) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:308) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (HBASE-17381) ReplicationSourceWorkerThread can die due to unhandled exceptions
[ https://issues.apache.org/jira/browse/HBASE-17381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huzheng updated HBASE-17381: Attachment: HBASE-17381.v3.patch > ReplicationSourceWorkerThread can die due to unhandled exceptions > - > > Key: HBASE-17381 > URL: https://issues.apache.org/jira/browse/HBASE-17381 > Project: HBase > Issue Type: Bug > Components: Replication >Reporter: Gary Helmling >Assignee: huzheng > Attachments: HBASE-17381.patch, HBASE-17381.v1.patch, > HBASE-17381.v2.patch, HBASE-17381.v3.patch > > > If a ReplicationSourceWorkerThread encounters an unexpected exception in the > run() method (for example failure to allocate direct memory for the DFS > client), the exception will be logged by the UncaughtExceptionHandler, but > the thread will also die and the replication queue will back up indefinitely > until the Regionserver is restarted. > We should make sure the worker thread is resilient to all exceptions that it > can actually handle. For those that it really can't, it seems better to > abort the regionserver rather than just allow replication to stop with > minimal signal. > Here is a sample exception: > {noformat} > ERROR regionserver.ReplicationSource: Unexpected exception in > ReplicationSourceWorkerThread, > currentPath=hdfs://.../hbase/WALs/XXXwalfilenameXXX > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:693) > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) > at > org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:96) > at > org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:113) > at > org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:108) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.createStreamPair(DataTransferSaslUtil.java:344) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:490) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:391) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160) > at > org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:92) > at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444) > at > org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:778) > at > org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:695) > at > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:356) > at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:673) > at > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934) > at java.io.DataInputStream.read(DataInputStream.java:100) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:308) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (HBASE-17381) ReplicationSourceWorkerThread can die due to unhandled exceptions
[ https://issues.apache.org/jira/browse/HBASE-17381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huzheng updated HBASE-17381: Assignee: huzheng Status: Patch Available (was: Open) > ReplicationSourceWorkerThread can die due to unhandled exceptions > - > > Key: HBASE-17381 > URL: https://issues.apache.org/jira/browse/HBASE-17381 > Project: HBase > Issue Type: Bug > Components: Replication >Reporter: Gary Helmling >Assignee: huzheng > Attachments: HBASE-17381.patch > > > If a ReplicationSourceWorkerThread encounters an unexpected exception in the > run() method (for example failure to allocate direct memory for the DFS > client), the exception will be logged by the UncaughtExceptionHandler, but > the thread will also die and the replication queue will back up indefinitely > until the Regionserver is restarted. > We should make sure the worker thread is resilient to all exceptions that it > can actually handle. For those that it really can't, it seems better to > abort the regionserver rather than just allow replication to stop with > minimal signal. > Here is a sample exception: > {noformat} > ERROR regionserver.ReplicationSource: Unexpected exception in > ReplicationSourceWorkerThread, > currentPath=hdfs://.../hbase/WALs/XXXwalfilenameXXX > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:693) > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) > at > org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:96) > at > org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:113) > at > org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:108) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.createStreamPair(DataTransferSaslUtil.java:344) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:490) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:391) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160) > at > org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:92) > at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444) > at > org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:778) > at > org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:695) > at > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:356) > at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:673) > at > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934) > at java.io.DataInputStream.read(DataInputStream.java:100) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:308) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-17381) ReplicationSourceWorkerThread can die due to unhandled exceptions
[ https://issues.apache.org/jira/browse/HBASE-17381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huzheng updated HBASE-17381: Attachment: HBASE-17381.patch > ReplicationSourceWorkerThread can die due to unhandled exceptions > - > > Key: HBASE-17381 > URL: https://issues.apache.org/jira/browse/HBASE-17381 > Project: HBase > Issue Type: Bug > Components: Replication >Reporter: Gary Helmling > Attachments: HBASE-17381.patch > > > If a ReplicationSourceWorkerThread encounters an unexpected exception in the > run() method (for example failure to allocate direct memory for the DFS > client), the exception will be logged by the UncaughtExceptionHandler, but > the thread will also die and the replication queue will back up indefinitely > until the Regionserver is restarted. > We should make sure the worker thread is resilient to all exceptions that it > can actually handle. For those that it really can't, it seems better to > abort the regionserver rather than just allow replication to stop with > minimal signal. > Here is a sample exception: > {noformat} > ERROR regionserver.ReplicationSource: Unexpected exception in > ReplicationSourceWorkerThread, > currentPath=hdfs://.../hbase/WALs/XXXwalfilenameXXX > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:693) > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) > at > org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:96) > at > org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:113) > at > org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:108) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.createStreamPair(DataTransferSaslUtil.java:344) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:490) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:391) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160) > at > org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:92) > at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444) > at > org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:778) > at > org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:695) > at > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:356) > at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:673) > at > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934) > at java.io.DataInputStream.read(DataInputStream.java:100) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:308) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-17381) ReplicationSourceWorkerThread can die due to unhandled exceptions
[ https://issues.apache.org/jira/browse/HBASE-17381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huzheng updated HBASE-17381: Status: Open (was: Patch Available) > ReplicationSourceWorkerThread can die due to unhandled exceptions > - > > Key: HBASE-17381 > URL: https://issues.apache.org/jira/browse/HBASE-17381 > Project: HBase > Issue Type: Bug > Components: Replication >Reporter: Gary Helmling > > If a ReplicationSourceWorkerThread encounters an unexpected exception in the > run() method (for example failure to allocate direct memory for the DFS > client), the exception will be logged by the UncaughtExceptionHandler, but > the thread will also die and the replication queue will back up indefinitely > until the Regionserver is restarted. > We should make sure the worker thread is resilient to all exceptions that it > can actually handle. For those that it really can't, it seems better to > abort the regionserver rather than just allow replication to stop with > minimal signal. > Here is a sample exception: > {noformat} > ERROR regionserver.ReplicationSource: Unexpected exception in > ReplicationSourceWorkerThread, > currentPath=hdfs://.../hbase/WALs/XXXwalfilenameXXX > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:693) > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) > at > org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:96) > at > org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:113) > at > org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:108) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.createStreamPair(DataTransferSaslUtil.java:344) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:490) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:391) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160) > at > org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:92) > at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444) > at > org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:778) > at > org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:695) > at > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:356) > at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:673) > at > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934) > at java.io.DataInputStream.read(DataInputStream.java:100) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:308) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-17381) ReplicationSourceWorkerThread can die due to unhandled exceptions
[ https://issues.apache.org/jira/browse/HBASE-17381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huzheng updated HBASE-17381: Status: Patch Available (was: Open) > ReplicationSourceWorkerThread can die due to unhandled exceptions > - > > Key: HBASE-17381 > URL: https://issues.apache.org/jira/browse/HBASE-17381 > Project: HBase > Issue Type: Bug > Components: Replication >Reporter: Gary Helmling > > If a ReplicationSourceWorkerThread encounters an unexpected exception in the > run() method (for example failure to allocate direct memory for the DFS > client), the exception will be logged by the UncaughtExceptionHandler, but > the thread will also die and the replication queue will back up indefinitely > until the Regionserver is restarted. > We should make sure the worker thread is resilient to all exceptions that it > can actually handle. For those that it really can't, it seems better to > abort the regionserver rather than just allow replication to stop with > minimal signal. > Here is a sample exception: > {noformat} > ERROR regionserver.ReplicationSource: Unexpected exception in > ReplicationSourceWorkerThread, > currentPath=hdfs://.../hbase/WALs/XXXwalfilenameXXX > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:693) > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) > at > org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:96) > at > org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:113) > at > org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:108) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.createStreamPair(DataTransferSaslUtil.java:344) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:490) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:391) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160) > at > org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:92) > at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444) > at > org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:778) > at > org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:695) > at > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:356) > at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:673) > at > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934) > at java.io.DataInputStream.read(DataInputStream.java:100) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:308) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-17381) ReplicationSourceWorkerThread can die due to unhandled exceptions
[ https://issues.apache.org/jira/browse/HBASE-17381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Helmling updated HBASE-17381: -- Component/s: Replication > ReplicationSourceWorkerThread can die due to unhandled exceptions > - > > Key: HBASE-17381 > URL: https://issues.apache.org/jira/browse/HBASE-17381 > Project: HBase > Issue Type: Bug > Components: Replication >Reporter: Gary Helmling > > If a ReplicationSourceWorkerThread encounters an unexpected exception in the > run() method (for example failure to allocate direct memory for the DFS > client), the exception will be logged by the UncaughtExceptionHandler, but > the thread will also die and the replication queue will back up indefinitely > until the Regionserver is restarted. > We should make sure the worker thread is resilient to all exceptions that it > can actually handle. For those that it really can't, it seems better to > abort the regionserver rather than just allow replication to stop with > minimal signal. > Here is a sample exception: > {noformat} > ERROR regionserver.ReplicationSource: Unexpected exception in > ReplicationSourceWorkerThread, > currentPath=hdfs://.../hbase/WALs/XXXwalfilenameXXX > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:693) > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) > at > org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:96) > at > org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:113) > at > org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:108) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.createStreamPair(DataTransferSaslUtil.java:344) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:490) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:391) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160) > at > org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:92) > at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444) > at > org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:778) > at > org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:695) > at > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:356) > at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:673) > at > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934) > at java.io.DataInputStream.read(DataInputStream.java:100) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:308) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)