Hi, thanks for the input. We use
https://www.hetzner.de/gb/hosting/produkte_rootserver/ex40
Each node has 2 HDD.

2015-04-24 19:07 GMT+02:00 Esteban Gutierrez <este...@cloudera.com>:

>  Hi Serega,
>
>
> The iostat data shows a very sharp spike in await time (I know is outside
> of the logs time range) and utilization is high  but hard to tell if the
> drives in the DNs are getting saturated continuously since it looks like
> averaged metric. Is this some kind of virtualized environment? are you
> using a NAS for the data volumes? If you look into the logs in context,
> seems that there is a bad data node that is causing issues in the HDFS
> pipeline and causing issues.
>
> cheers,
> esteban.
>
>
>
> --
> Cloudera, Inc.
>
>
> On Thu, Apr 23, 2015 at 1:37 PM, Serega Sheypak <serega.shey...@gmail.com>
> wrote:
>
> > Hi, is there any input here? What we should monitor?
> >
> > 2015-04-22 20:55 GMT+02:00 Serega Sheypak <serega.shey...@gmail.com>:
> >
> > > Here is disk stats. Sadness appeared ad 12.30 - 13.00
> > >
> > >
> >
> https://www.dropbox.com/s/lj4r8o10buv1n2o/Screenshot%202015-04-22%2020.48.18.png?dl=0
> > >
> > > 2015-04-22 20:41 GMT+02:00 Serega Sheypak <serega.shey...@gmail.com>:
> > >
> > >> Here is an image
> > >>
> > >> 2015-04-22 20:40 GMT+02:00 Serega Sheypak <serega.shey...@gmail.com>:
> > >>
> > >>> Here are datanode logs from 5.9.41.237 <http://5.9.41.237:50010/>,
> > >>> regionserver logs were from5.9.41.237 <http://5.9.41.237:50010/>
> also
> > >>>
> > >>> EQUEST_SHORT_CIRCUIT_FDS, blockid: 1078130838, srvID:
> > >>> 659e6be2-8d98-458b-94bc-3bcdbb517508, success: true
> > >>> 2015-04-22 12:46:17,154 INFO
> > >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> > >>> 127.0.0.1, dest: 127.0.0.1, op: RELEASE_SHORT_CIRCUIT_FDS, shmId:
> > >>> bd37825f7a445e2da6796940ebb754d6, slotIdx: 24, srvID:
> > >>> 659e6be2-8d98-458b-94bc-3bcdbb517508, success: true
> > >>> 2015-04-22 12:46:17,204 INFO
> > >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> > >>> 127.0.0.1, dest: 127.0.0.1, op: RELEASE_SHORT_CIRCUIT_FDS, shmId:
> > >>> aea53fb897c383f3dec304ed618db0df, slotIdx: 4, srvID:
> > >>> 659e6be2-8d98-458b-94bc-3bcdbb517508, success: true
> > >>> 2015-04-22 12:46:17,219 INFO
> > >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> > >>> 127.0.0.1, dest: 127.0.0.1, op: RELEASE_SHORT_CIRCUIT_FDS, shmId:
> > >>> 0178cc0245fcdc9e1dd75c5f8c6da1eb, slotIdx: 0, srvID:
> > >>> 659e6be2-8d98-458b-94bc-3bcdbb517508, success: true
> > >>> 2015-04-22 12:46:17,236 INFO
> > >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> > >>> 127.0.0.1, dest: 127.0.0.1, op: RELEASE_SHORT_CIRCUIT_FDS, shmId:
> > >>> aea53fb897c383f3dec304ed618db0df, slotIdx: 102, srvID:
> > >>> 659e6be2-8d98-458b-94bc-3bcdbb517508, success: true
> > >>> 2015-04-22 12:46:17,573 INFO
> > >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> > >>> 5.9.41.237:50010, dest: /5.9.65.143:36281, bytes: 4608, op:
> HDFS_READ,
> > >>> cliID: DFSClient_hb_rs_domain06.myhost.ru
> > ,60020,1426776843636_-1084357126_33,
> > >>> offset: 28435456, srvID: 659e6be2-8d98-458b-94bc-3bcdbb517508,
> blockid:
> > >>> BP-927943268-5.9.77.105-1414682145673:blk_1078147715_4407483,
> duration:
> > >>> 12396486
> > >>> 2015-04-22 12:46:17,596 INFO
> > >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> > >>> 5.9.41.237:50010, dest: /78.46.48.37:41539, bytes: 4608, op:
> > HDFS_READ,
> > >>> cliID: DFSClient_hb_rs_domain13.myhost.ru
> > ,60020,1429640630559_-531755738_33,
> > >>> offset: 56052736, srvID: 659e6be2-8d98-458b-94bc-3bcdbb517508,
> blockid:
> > >>> BP-927943268-5.9.77.105-1414682145673:blk_1078106901_4366668,
> duration:
> > >>> 37455427
> > >>> 2015-04-22 12:46:17,821 INFO
> > >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> > >>> 5.9.41.237:50010, dest: /5.9.67.100:58718, bytes: 5120, op:
> HDFS_READ,
> > >>> cliID: DFSClient_hb_rs_domain03.myhost.ru
> > ,60020,1426877064271_-1246826933_33,
> > >>> offset: 77630464, srvID: 659e6be2-8d98-458b-94bc-3bcdbb517508,
> blockid:
> > >>> BP-927943268-5.9.77.105-1414682145673:blk_1078119757_4379524,
> duration:
> > >>> 16386940
> > >>> 2015-04-22 12:46:18,769 INFO
> > >>> org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to send data:
> > >>> java.net.SocketTimeoutException: 480000 millis timeout while waiting
> > for
> > >>> channel to be ready for write. ch :
> > >>> java.nio.channels.SocketChannel[connected local=/5.9.41.237:50010
> > >>> remote=/78.46.48.37:40503]
> > >>> 2015-04-22 12:46:18,769 INFO
> > >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> > >>> 5.9.41.237:50010, dest: /78.46.48.37:40503, bytes: 393216, op:
> > >>> HDFS_READ, cliID: DFSClient_hb_rs_domain13.myhost.ru
> > ,60020,1429640630559_-531755738_33,
> > >>> offset: 3584, srvID: 659e6be2-8d98-458b-94bc-3bcdbb517508, blockid:
> > >>> BP-927943268-5.9.77.105-1414682145673:blk_1078155386_4415154,
> duration:
> > >>> 480345113893
> > >>> 2015-04-22 12:46:18,769 WARN
> > >>> org.apache.hadoop.hdfs.server.datanode.DataNode:
> > >>> DatanodeRegistration(5.9.41.237,
> > >>> datanodeUuid=659e6be2-8d98-458b-94bc-3bcdbb517508, infoPort=50075,
> > >>> ipcPort=50020,
> storageInfo=lv=-56;cid=cluster11;nsid=527111981;c=0):Got
> > >>> exception while serving
> > >>> BP-927943268-5.9.77.105-1414682145673:blk_1078155386_4415154 to /
> > >>> 78.46.48.37:40503
> > >>> java.net.SocketTimeoutException: 480000 millis timeout while waiting
> > for
> > >>> channel to be ready for write. ch :
> > >>> java.nio.channels.SocketChannel[connected local=/5.9.41.237:50010
> > >>> remote=/78.46.48.37:40503]
> > >>> at
> > >>>
> >
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
> > >>> at
> > >>>
> >
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)
> > >>> at
> > >>>
> >
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:547)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:716)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:486)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:111)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:69)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:225)
> > >>> at java.lang.Thread.run(Thread.java:745)
> > >>> 2015-04-22 12:46:18,770 ERROR
> > >>> org.apache.hadoop.hdfs.server.datanode.DataNode:
> > domain02.myhost.ru:50010:DataXceiver
> > >>> error processing READ_BLOCK operation  src: /78.46.48.37:40503 dst:
> /
> > >>> 5.9.41.237:50010
> > >>> java.net.SocketTimeoutException: 480000 millis timeout while waiting
> > for
> > >>> channel to be ready for write. ch :
> > >>> java.nio.channels.SocketChannel[connected local=/5.9.41.237:50010
> > >>> remote=/78.46.48.37:40503]
> > >>> at
> > >>>
> >
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
> > >>> at
> > >>>
> >
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)
> > >>> at
> > >>>
> >
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:547)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:716)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:486)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:111)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:69)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:225)
> > >>> at java.lang.Thread.run(Thread.java:745)
> > >>> 2015-04-22 12:46:18,897 INFO
> > >>> org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to send data:
> > >>> java.net.SocketTimeoutException: 480000 millis timeout while waiting
> > for
> > >>> channel to be ready for write. ch :
> > >>> java.nio.channels.SocketChannel[connected local=/5.9.41.237:50010
> > >>> remote=/5.9.74.13:34323]
> > >>> 2015-04-22 12:46:18,897 INFO
> > >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> > >>> 5.9.41.237:50010, dest: /5.9.74.13:34323, bytes: 393216, op:
> > HDFS_READ,
> > >>> cliID: DFSClient_hb_rs_domain09.myhost.ru
> > ,60020,1426772397450_1095315710_33,
> > >>> offset: 3584, srvID: 659e6be2-8d98-458b-94bc-3bcdbb517508, blockid:
> > >>> BP-927943268-5.9.77.105-1414682145673:blk_1078159305_4419073,
> duration:
> > >>> 480265217202
> > >>> 2015-04-22 12:46:18,897 WARN
> > >>> org.apache.hadoop.hdfs.server.datanode.DataNode:
> > >>> DatanodeRegistration(5.9.41.237,
> > >>> datanodeUuid=659e6be2-8d98-458b-94bc-3bcdbb517508, infoPort=50075,
> > >>> ipcPort=50020,
> storageInfo=lv=-56;cid=cluster11;nsid=527111981;c=0):Got
> > >>> exception while serving
> > >>> BP-927943268-5.9.77.105-1414682145673:blk_1078159305_4419073 to /
> > >>> 5.9.74.13:34323
> > >>> java.net.SocketTimeoutException: 480000 millis timeout while waiting
> > for
> > >>> channel to be ready for write. ch :
> > >>> java.nio.channels.SocketChannel[connected local=/5.9.41.237:50010
> > >>> remote=/5.9.74.13:34323]
> > >>> at
> > >>>
> >
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
> > >>> at
> > >>>
> >
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)
> > >>> at
> > >>>
> >
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:547)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:716)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:486)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:111)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:69)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:225)
> > >>> at java.lang.Thread.run(Thread.java:745)
> > >>> 2015-04-22 12:46:18,898 ERROR
> > >>> org.apache.hadoop.hdfs.server.datanode.DataNode:
> > domain02.myhost.ru:50010:DataXceiver
> > >>> error processing READ_BLOCK operation  src: /5.9.74.13:34323 dst: /
> > >>> 5.9.41.237:50010
> > >>> java.net.SocketTimeoutException: 480000 millis timeout while waiting
> > for
> > >>> channel to be ready for write. ch :
> > >>> java.nio.channels.SocketChannel[connected local=/5.9.41.237:50010
> > >>> remote=/5.9.74.13:34323]
> > >>> at
> > >>>
> >
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
> > >>> at
> > >>>
> >
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)
> > >>> at
> > >>>
> >
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:547)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:716)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:486)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:111)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:69)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:225)
> > >>> at java.lang.Thread.run(Thread.java:745)
> > >>> 2015-04-22 12:46:18,933 INFO
> > >>> org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to send data:
> > >>> java.net.SocketTimeoutException: 480000 millis timeout while waiting
> > for
> > >>> channel to be ready for write. ch :
> > >>> java.nio.channels.SocketChannel[connected local=/5.9.41.237:50010
> > >>> remote=/5.9.37.17:48803]
> > >>> 2015-04-22 12:46:18,933 INFO
> > >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> > >>> 5.9.41.237:50010, dest: /5.9.37.17:48803, bytes: 393216, op:
> > HDFS_READ,
> > >>> cliID: DFSClient_hb_rs_domain07.myhost.ru
> > ,60020,1426776403757_1987133929_33,
> > >>> offset: 3584, srvID: 659e6be2-8d98-458b-94bc-3bcdbb517508, blockid:
> > >>> BP-927943268-5.9.77.105-1414682145673:blk_1078158534_4418302,
> duration:
> > >>> 480375396714
> > >>> 2015-04-22 12:46:18,933 WARN
> > >>> org.apache.hadoop.hdfs.server.datanode.DataNode:
> > >>> DatanodeRegistration(5.9.41.237,
> > >>> datanodeUuid=659e6be2-8d98-458b-94bc-3bcdbb517508, infoPort=50075,
> > >>> ipcPort=50020,
> storageInfo=lv=-56;cid=cluster11;nsid=527111981;c=0):Got
> > >>> exception while serving
> > >>> BP-927943268-5.9.77.105-1414682145673:blk_1078158534_4418302 to /
> > >>> 5.9.37.17:48803
> > >>> java.net.SocketTimeoutException: 480000 millis timeout while waiting
> > for
> > >>> channel to be ready for write. ch :
> > >>> java.nio.channels.SocketChannel[connected local=/5.9.41.237:50010
> > >>> remote=/5.9.37.17:48803]
> > >>> at
> > >>>
> >
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
> > >>> at
> > >>>
> >
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)
> > >>> at
> > >>>
> >
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:547)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:716)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:486)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:111)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:69)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:225)
> > >>> at java.lang.Thread.run(Thread.java:745)
> > >>> 2015-04-22 12:46:18,934 ERROR
> > >>> org.apache.hadoop.hdfs.server.datanode.DataNode:
> > domain02.myhost.ru:50010:DataXceiver
> > >>> error processing READ_BLOCK operation  src: /5.9.37.17:48803 dst: /
> > >>> 5.9.41.237:50010
> > >>> java.net.SocketTimeoutException: 480000 millis timeout while waiting
> > for
> > >>> channel to be ready for write. ch :
> > >>> java.nio.channels.SocketChannel[connected local=/5.9.41.237:50010
> > >>> remote=/5.9.37.17:48803]
> > >>> at
> > >>>
> >
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
> > >>> at
> > >>>
> >
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)
> > >>> at
> > >>>
> >
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:547)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:716)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:486)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:111)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:69)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:225)
> > >>> at java.lang.Thread.run(Thread.java:745)
> > >>> 2015-04-22 12:46:20,478 INFO
> > >>> org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to send data:
> > >>> java.net.SocketTimeoutException: 480000 millis timeout while waiting
> > for
> > >>> channel to be ready for write. ch :
> > >>> java.nio.channels.SocketChannel[connected local=/5.9.41.237:50010
> > >>> remote=/5.9.77.105:36871]
> > >>> 2015-04-22 12:46:20,478 INFO
> > >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> > >>> 5.9.41.237:50010, dest: /5.9.77.105:36871, bytes: 327680, op:
> > >>> HDFS_READ, cliID: DFSClient_hb_rs_domain01.myhost.ru
> > ,60020,1427209265829_704706779_33,
> > >>> offset: 3584, srvID: 659e6be2-8d98-458b-94bc-3bcdbb517508, blockid:
> > >>> BP-927943268-5.9.77.105-1414682145673:blk_1078159312_4419080,
> duration:
> > >>> 480489227857
> > >>> 2015-04-22 12:46:20,478 WARN
> > >>> org.apache.hadoop.hdfs.server.datanode.DataNode:
> > >>> DatanodeRegistration(5.9.41.237,
> > >>> datanodeUuid=659e6be2-8d98-458b-94bc-3bcdbb517508, infoPort=50075,
> > >>> ipcPort=50020,
> storageInfo=lv=-56;cid=cluster11;nsid=527111981;c=0):Got
> > >>> exception while serving
> > >>> BP-927943268-5.9.77.105-1414682145673:blk_1078159312_4419080 to /
> > >>> 5.9.77.105:36871
> > >>> java.net.SocketTimeoutException: 480000 millis timeout while waiting
> > for
> > >>> channel to be ready for write. ch :
> > >>> java.nio.channels.SocketChannel[connected local=/5.9.41.237:50010
> > >>> remote=/5.9.77.105:36871]
> > >>> at
> > >>>
> >
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
> > >>> at
> > >>>
> >
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)
> > >>> at
> > >>>
> >
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:547)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:716)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:486)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:111)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:69)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:225)
> > >>> at java.lang.Thread.run(Thread.java:745)
> > >>> 2015-04-22 12:46:20,478 ERROR
> > >>> org.apache.hadoop.hdfs.server.datanode.DataNode:
> > domain02.myhost.ru:50010:DataXceiver
> > >>> error processing READ_BLOCK operation  src: /5.9.77.105:36871 dst: /
> > >>> 5.9.41.237:50010
> > >>> java.net.SocketTimeoutException: 480000 millis timeout while waiting
> > for
> > >>> channel to be ready for write. ch :
> > >>> java.nio.channels.SocketChannel[connected local=/5.9.41.237:50010
> > >>> remote=/5.9.77.105:36871]
> > >>> at
> > >>>
> >
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
> > >>> at
> > >>>
> >
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)
> > >>> at
> > >>>
> >
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:547)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:716)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:486)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:111)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:69)
> > >>> at
> > >>>
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:225)
> > >>> at java.lang.Thread.run(Thread.java:745)
> > >>> 2015-04-22 12:46:20,677 INFO
> > >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> > >>> 5.9.41.237:50010, dest: /5.9.65.143:36286, bytes: 5120, op:
> HDFS_READ,
> > >>> cliID: DFSClient_hb_rs_domain06.myhost.ru
> > ,60020,1426776843636_-1084357126_33,
> > >>> offset: 86653440, srvID: 659e6be2-8d98-458b-94bc-3bcdbb517508,
> blockid:
> > >>> BP-927943268-5.9.77.105-1414682145673:blk_1078128115_4387882,
> duration:
> > >>> 17623602
> > >>> 2015-04-22 12:46:20,690 INFO
> > >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> > >>> 5.9.41.237:50010, dest: /78.46.48.38:57022, bytes: 5120, op:
> > HDFS_READ,
> > >>> cliID: DFSClient_hb_rs_domain14.myhost.ru
> > ,60020,1429640630771_156324173_33,
> > >>> offset: 103135232, srvID: 659e6be2-8d98-458b-94bc-3bcdbb517508,
> > blockid:
> > >>> BP-927943268-5.9.77.105-1414682145673:blk_1078120513_4380280,
> duration:
> > >>> 24431169
> > >>> 2015-04-22 12:46:20,747 INFO
> > >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> > >>> 5.9.41.237:50010, dest: /5.9.40.136:34501, bytes: 5120, op:
> HDFS_READ,
> > >>> cliID: DFSClient_hb_rs_domain04.myhost.ru
> > ,60020,1426774142736_-1119788258_33,
> > >>> offset: 78469632, srvID: 659e6be2-8d98-458b-94bc-3bcdbb517508,
> blockid:
> > >>> BP-927943268-5.9.77.105-1414682145673:blk_1078108223_4367990,
> duration:
> > >>> 76055489
> > >>> 2015-04-22 12:46:20,758 INFO
> > >>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving
> > >>> BP-927943268-5.9.77.105-1414682145673:blk_1078160469_4420237 src: /
> > >>> 5.9.74.13:35440 dest: /5.9.41.237:50010
> > >>>
> > >>>
> > >>> 2015-04-22 20:30 GMT+02:00 Jean-Marc Spaggiari <
> > jean-m...@spaggiari.org>
> > >>> :
> > >>>
> > >>>> Did you send the logs from one of those 3 servers? [
> 5.9.41.237:50010,
> > >>>> 5.9.77.105:50010, 5.9.73.19:50010]
> > >>>>
> > >>>> Sound like something is slowing done everything. Can you extract DN
> > logs
> > >>>> for the same time?
> > >>>>
> > >>>> Do you have any tool monitoring the disks and network latency over
> > time?
> > >>>>
> > >>>> If not, can you run iostat and try to reproduce the issue?
> > >>>>
> > >>>> JM
> > >>>>
> > >>>> 2015-04-22 14:23 GMT-04:00 Serega Sheypak <serega.shey...@gmail.com
> >:
> > >>>>
> > >>>> > > major compaction runs daily.
> > >>>> >
> > >>>> > >What else do you see in the RS logs?
> > >>>> > no error, only *Slow sync cost *
> > >>>> >
> > >>>> > >How iostat looks like?
> > >>>> > please see image. 12.00 - 12.30 is a time when reading/writing
> > stopped
> > >>>> > [image: Встроенное изображение 1]
> > >>>> >
> > >>>> >
> > >>>> > >Can you share the logs around the time this occurs?
> > >>>> > 2015-04-22 12:53:09,996 WARN org.apache.hadoop.ipc.RpcServer:
> > >>>> > (responseTooSlow):
> > >>>> >
> > >>>>
> >
> {"processingtimems":20128,"call":"Multi(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MultiRequest)","client":"
> > >>>> > 5.9.75.155:58344
> > >>>> >
> > >>>>
> >
> ","starttimems":1429692769868,"queuetimems":28034,"class":"HRegionServer","responsesize":8,"method":"Multi"}
> > >>>> > 2015-04-22 12:53:09,996 WARN org.apache.hadoop.ipc.RpcServer:
> > >>>> > (responseTooSlow):
> > >>>> >
> > >>>>
> >
> {"processingtimems":21434,"call":"Mutate(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MutateRequest)","client":"
> > >>>> > 46.4.0.110:60149
> > >>>> >
> > >>>>
> >
> ","starttimems":1429692768562,"queuetimems":44263,"class":"HRegionServer","responsesize":2,"method":"Mutate"}
> > >>>> > 2015-04-22 12:53:10,093 WARN org.apache.hadoop.ipc.RpcServer:
> > >>>> > (responseTooSlow):
> > >>>> >
> > >>>>
> >
> {"processingtimems":17997,"call":"Mutate(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MutateRequest)","client":"46.4.0.
> > >>>> >
> > >>>> > 015-04-22 12:53:10,270 WARN org.apache.hadoop.ipc.RpcServer:
> > >>>> > (responseTooSlow):
> > >>>> >
> > >>>>
> >
> {"processingtimems":18175,"call":"Mutate(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MutateRequest)","client":"
> > >>>> > 144.76.218.107:48620
> > >>>> >
> > >>>>
> >
> ","starttimems":1429692772095,"queuetimems":49253,"class":"HRegionServer","responsesize":2,"method":"Mutate"}
> > >>>> > 2015-04-22 12:53:10,315 INFO
> > >>>> > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost:
> 319
> > >>>> ms,
> > >>>> > current pipeline: [5.9.41.237:50010, 5.9.77.105:50010,
> > >>>> 5.9.73.19:50010]
> > >>>> > 2015-04-22 12:53:10,315 INFO
> > >>>> > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost:
> 319
> > >>>> ms,
> > >>>> > current pipeline: [5.9.41.237:50010, 5.9.77.105:50010,
> > >>>> 5.9.73.19:50010]
> > >>>> > 2015-04-22 12:53:10,315 INFO
> > >>>> > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost:
> 319
> > >>>> ms,
> > >>>> > current pipeline: [5.9.41.237:50010, 5.9.77.105:50010,
> > >>>> 5.9.73.19:50010]
> > >>>> > 2015-04-22 12:53:10,315 INFO
> > >>>> > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost:
> 319
> > >>>> ms,
> > >>>> > current pipeline: [5.9.41.237:50010, 5.9.77.105:50010,
> > >>>> 5.9.73.19:50010]
> > >>>> > 2015-04-22 12:53:10,315 WARN org.apache.hadoop.ipc.RpcServer:
> > >>>> > (responseTooSlow):
> > >>>> > {"processingtimems":20231,"call":"Mutate(org.apache.hadoop.hbase.
> > >>>> >
> > >>>> > 2015-04-22 12:53:11,726 INFO
> > >>>> org.apache.hadoop.hbase.regionserver.HStore:
> > >>>> > Added
> > >>>> >
> > >>>>
> >
> hdfs://nameservice1/hbase/data/default/cross_id_to_rtb_user/d2e01ed1de47dd92edb5c1de37277d4a/c/a93405e51088418ba95241cf53456011,
> > >>>> > entries=926, sequenceid=43403085, filesize=258.5 K
> > >>>> > 2015-04-22 12:53:11,726 INFO
> > >>>> org.apache.hadoop.hbase.regionserver.HRegion:
> > >>>> > Finished memstore flush of ~3.8 M/3985504, currentsize=12.6
> K/12912
> > >>>> for
> > >>>> > region
> > >>>> >
> > >>>>
> >
> cross_id_to_rtb_user,aadf0929,1429256709528.d2e01ed1de47dd92edb5c1de37277d4a.
> > >>>> > in 857ms, sequenceid=43403085, compaction requested=true
> > >>>> > 2015-04-22 12:53:11,726 INFO
> > >>>> org.apache.hadoop.hbase.regionserver.HRegion:
> > >>>> > Starting compaction on c in region
> > >>>> >
> > >>>>
> >
> cross_id_to_rtb_user,aadf0929,1429256709528.d2e01ed1de47dd92edb5c1de37277d4a.
> > >>>> > 2015-04-22 12:53:11,726 INFO
> > >>>> org.apache.hadoop.hbase.regionserver.HStore:
> > >>>> > Starting compaction of 3 file(s) in c of
> > >>>> >
> > >>>>
> >
> cross_id_to_rtb_user,aadf0929,1429256709528.d2e01ed1de47dd92edb5c1de37277d4a.
> > >>>> > into
> > >>>> >
> > >>>>
> >
> tmpdir=hdfs://nameservice1/hbase/data/default/cross_id_to_rtb_user/d2e01ed1de47dd92edb5c1de37277d4a/.tmp,
> > >>>> > totalSize=8.0 M
> > >>>> > 2015-04-22 12:53:11,917 INFO
> > >>>> > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost:
> 417
> > >>>> ms,
> > >>>> > current pipeline: [5.9.41.237:50010, 5.9.77.105:50010,
> > >>>> 5.9.73.19:50010]
> > >>>> > 2015-04-22 12:53:11,939 INFO
> > >>>> > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost:
> 340
> > >>>> ms,
> > >>>> > current pipeline: [5.9.41.237:50010, 5.9.77.105:50010,
> > >>>> 5.9.73.19:50010]
> > >>>> > 2015-04-22 12:53:11,939 INFO
> > >>>> > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost:
> 340
> > >>>> ms,
> > >>>> > current pipeline: [5.9.41.237:50010, 5.9.77.105:50010,
> > >>>> 5.9.73.19:50010]
> > >>>> > 2015-04-22 12:53:11,939 INFO
> > >>>> > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost:
> 340
> > >>>> ms,
> > >>>> > current pipeline: [5.9.41.237:50010, 5.9.77.105:50010,
> > >>>> 5.9.73.19:50010]
> > >>>> > 2015-04-22 12:53:11,939 INFO
> > >>>> > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost:
> 340
> > >>>> ms,
> > >>>> > current pipeline: [5.9.41.237:50010, 5.9.77.105:50010,
> > >>>> 5.9.73.19:50010]
> > >>>> > 2015-04-22 12:53:12,199 INFO
> > >>>> > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost:
> 282
> > >>>> ms,
> > >>>> > current pipeline: [5.9.41.237:50010, 5.9.77.105:50010,
> > >>>> 5.9.73.19:50010]
> > >>>> > 2015-04-22 12:53:13,132 INFO
> > >>>> > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost:
> > 1193
> > >>>> ms,
> > >>>> > current pipeline: [5.9.41.237:50010, 5.9.77.105:50010,
> > >>>> 5.9.73.19:50010]
> > >>>> > 2015-04-22 12:53:13,132 INFO
> > >>>> > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost:
> > 1193
> > >>>> ms,
> > >>>> > current pipeline: [5.9.41.237:50010, 5.9.77.105:50010,
> > >>>> 5.9.73.19:50010]
> > >>>> > 2015-04-22 12:53:13,132 INFO
> > >>>> > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost:
> > 1193
> > >>>> ms,
> > >>>> > current pipeline: [5.9.41.237:50010, 5.9.77.105:50010,
> > >>>> 5.9.73.19:50010]
> > >>>> > 2015-04-22 12:53:13,132 INFO
> > >>>> > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost:
> > 1193
> > >>>> ms,
> > >>>> > current pipeline: [5.9.41.237:50010, 5.9.77.105:50010,
> > >>>> 5.9.73.19:50010]
> > >>>> > 2015-04-22 12:53:13,247 INFO
> > >>>> > org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher: Flushed,
> > >>>> > sequenceid=41975630, memsize=3.6 M, hasBloomFilter=true, into tmp
> > file
> > >>>> >
> > >>>>
> >
> hdfs://nameservice1/hbase/data/default/rtb_user_to_cross_id/c54d51dd7f20377ea34f05824abbaf76/.tmp/3495a2b487fa448bb9e0b50050b11c3a
> > >>>> > 2015-04-22 12:53:13,301 INFO
> > >>>> org.apache.hadoop.hbase.regionserver.HStore:
> > >>>> > Added
> > >>>> >
> > >>>>
> >
> hdfs://nameservice1/hbase/data/default/rtb_user_to_cross_id/c54d51dd7f20377ea34f05824abbaf76/c/3495a2b487fa448bb9e0b50050b11c3a,
> > >>>> > entries=886, sequenceid=41975630, filesize=272.9 K
> > >>>> >
> > >>>> > 2015-04-22 19:54 GMT+02:00 Ted Yu <yuzhih...@gmail.com>:
> > >>>> >
> > >>>> >> Serega:
> > >>>> >> How often is major compaction run in your cluster ?
> > >>>> >>
> > >>>> >> Have you configured offpeak compaction ?
> > >>>> >> See related parameters in:
> > >>>> >> http://hbase.apache.org/book.html#compaction.parameters
> > >>>> >>
> > >>>> >> Cheers
> > >>>> >>
> > >>>> >> On Wed, Apr 22, 2015 at 10:39 AM, Serega Sheypak <
> > >>>> >> serega.shey...@gmail.com>
> > >>>> >> wrote:
> > >>>> >>
> > >>>> >> > Hi, we have 10 nodes cluster running HBase 0.98 CDH 5.2.1
> > >>>> >> > Sometimes HBase stucks.
> > >>>> >> > We have several apps constantly writing/reading data from it.
> > >>>> Sometimes
> > >>>> >> we
> > >>>> >> > see that apps response time dramatically increases. It means
> that
> > >>>> app
> > >>>> >> > spends seconds to read/write from/to HBase. in 99% of time it
> > takes
> > >>>> >> 20ms.
> > >>>> >> >
> > >>>> >> > I suppose that compactions/major compactions could be the root
> > >>>> cause. I
> > >>>> >> see
> > >>>> >> > that compactions start at the same time when we have problems
> > with
> > >>>> app.
> > >>>> >> > Could it be so?
> > >>>> >> > So HBase can't write to WAL because compactions consumes all IO
> > >>>> and apps
> > >>>> >> > stops to write data?
> > >>>> >> >
> > >>>> >>
> > >>>> >
> > >>>> >
> > >>>>
> > >>>
> > >>>
> > >>
> > >
> >
>

Reply via email to