While looking into this problem, I found that I have large dncp_block_verification.log.curr and dncp_block_verification.log.prev files. They are 294G each in the node which has high IOWAIT, even when the cluster was almost idle. While the others have 0 for dncp_block_verification.log.curr, and <15G for dncp_block_verification.log.prev. So it looks like https://issues.apache.org/jira/browse/HDFS-6114 <https://issues.apache.org/jira/browse/HDFS-6114>
Thanks. > On 04 Sep 2015, at 11:56, Adrien Mogenet <[email protected]> > wrote: > > What is your disk configuration? JBOD? If RAID, possibly a dysfunctional RAID > controller, or a constantly-rebuilding array. > > Do you have any idea at which files are linked the read blocks? > > On 4 September 2015 at 11:02, Akmal Abbasov <[email protected] > <mailto:[email protected]>> wrote: > Hi Adrien, > for the last 24 hours all RS are up and running. There was no region > transitions. > The overall cluster iowait has decreased, but still 2 RS have very high > iowait, while there is no load on the cluster. > My assumption with the hight number of HDFS_READ/HDFS_WRITE in RS logs have > failed, since all RS have almost identical number > of HDFS_READ/HDFS_WRITE, while only 2 of them has high iowait. > According to iotop the process which is doing most IO is datanode, and it is > reading constantly. > Why datanode could require reading from disk constantly? > Any ideas? > > Thanks. > >> On 03 Sep 2015, at 18:57, Adrien Mogenet <[email protected] >> <mailto:[email protected]>> wrote: >> >> Is the uptime of RS "normal"? No quick and global reboot that could lead >> into a regiongi-reallocation-storm? >> >> On 3 September 2015 at 18:42, Akmal Abbasov <[email protected] >> <mailto:[email protected]>> wrote: >> Hi Adrien, >> I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also hbase >> is consistent. >> I’m using default value of the replication, so it is 3. >> There are some under replicated >> HBase master(node 10.10.8.55) is reading constantly from regionservers. Only >> today, it send >150.000 HDFS_READ requests to each regionserver so far, >> while the hbase cluster is almost idle. >> What could cause this kind of behaviour? >> >> p.s. each node in the cluster have 2 core, 4 gb ram, just in case. >> >> Thanks. >> >> >>> On 03 Sep 2015, at 17:46, Adrien Mogenet <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> Is your HDFS healthy (fsck /)? >>> >>> Same for hbase hbck? >>> >>> What's your replication level? >>> >>> Can you see constant network use as well? >>> >>> Anything than might be triggered by the hbasemaster? (something like a >>> virtually dead RS, due to ZK race-condition, etc.) >>> >>> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major >>> compaction, successfully, yesterday. >>> >>> On 3 September 2015 at 16:32, Akmal Abbasov <[email protected] >>> <mailto:[email protected]>> wrote: >>> I’ve started HDFS balancer, but then stopped it immediately after knowing >>> that it is not a good idea. >>> but it was around 3 weeks ago, is it possible that it had an influence on >>> the cluster behaviour I’m having now? >>> Thanks. >>> >>>> On 03 Sep 2015, at 14:23, Akmal Abbasov <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> >>>> Hi Ted, >>>> No there is no short-circuit read configured. >>>> The logs of datanode of the 10.10.8.55 are full of following messages >>>> 2015-09-03 12:03:56,324 INFO >>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: >>>> /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 >>>> <http://10.10.8.53:58622/>, bytes: 77, op: HDFS_READ, cliID: >>>> DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: >>>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: >>>> BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: >>>> 276448307 >>>> 2015-09-03 12:03:56,494 INFO >>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: >>>> /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 >>>> <http://10.10.8.53:58622/>, bytes: 538, op: HDFS_READ, cliID: >>>> DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: >>>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: >>>> BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: >>>> 60550244 >>>> 2015-09-03 12:03:59,561 INFO >>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: >>>> /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 >>>> <http://10.10.8.53:58622/>, bytes: 455, op: HDFS_READ, cliID: >>>> DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: >>>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: >>>> BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: >>>> 755613819 >>>> There are >100.000 of them just for today. The situation with other >>>> regionservers are similar. >>>> Node 10.10.8.53 is hbase-master node, and the process on the port is also >>>> hbase-master. >>>> So if there is no load on the cluster, why there are so much IO happening? >>>> Any thoughts. >>>> Thanks. >>>> >>>>> On 02 Sep 2015, at 21:57, Ted Yu <[email protected] >>>>> <mailto:[email protected]>> wrote: >>>>> >>>>> I assume you have enabled short-circuit read. >>>>> >>>>> Can you capture region server stack trace(s) and pastebin them ? >>>>> >>>>> Thanks >>>>> >>>>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <[email protected] >>>>> <mailto:[email protected]>> wrote: >>>>> Hi Ted, >>>>> I’ve checked the time when addresses were changed, and this strange >>>>> behaviour started weeks before it. >>>>> >>>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master. >>>>> any thoughts? >>>>> >>>>> Thanks >>>>> >>>>>> On 02 Sep 2015, at 18:45, Ted Yu <[email protected] >>>>>> <mailto:[email protected]>> wrote: >>>>>> >>>>>> bq. change the ip addresses of the cluster nodes >>>>>> >>>>>> Did this happen recently ? If high iowait was observed after the change >>>>>> (you can look at ganglia graph), there is a chance that the change was >>>>>> related. >>>>>> >>>>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region >>>>>> server resides. >>>>>> >>>>>> Cheers >>>>>> >>>>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <[email protected] >>>>>> <mailto:[email protected]>> wrote: >>>>>> Hi Ted, >>>>>> sorry forget to mention >>>>>> >>>>>>> release of hbase / hadoop you're using >>>>>> >>>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1 >>>>>> >>>>>>> were region servers doing compaction ? >>>>>> >>>>>> I’ve run major compactions manually earlier today, but it seems that >>>>>> they already completed, looking at the compactionQueueSize. >>>>>> >>>>>>> have you checked region server logs ? >>>>>> The logs of datanode is full of this kind of messages >>>>>> 2015-09-02 16:37:06,950 INFO >>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: >>>>>> /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 >>>>>> <http://10.10.8.54:32959/>, bytes: 19673, op: HDFS_READ, cliID: >>>>>> DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: >>>>>> ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: >>>>>> BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: >>>>>> 7881815 >>>>>> >>>>>> p.s. we had to change the ip addresses of the cluster nodes, is it >>>>>> relevant? >>>>>> >>>>>> Thanks. >>>>>> >>>>>>> On 02 Sep 2015, at 18:20, Ted Yu <[email protected] >>>>>>> <mailto:[email protected]>> wrote: >>>>>>> >>>>>>> Please provide some more information: >>>>>>> >>>>>>> release of hbase / hadoop you're using >>>>>>> were region servers doing compaction ? >>>>>>> have you checked region server logs ? >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <[email protected] >>>>>>> <mailto:[email protected]>> wrote: >>>>>>> Hi, >>>>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only >>>>>>> <5 puts and gets. >>>>>>> But the data in hdfs is increasing, and region servers have very high >>>>>>> iowait(>100, in 2 core CPU). >>>>>>> iotop shows that datanode process is reading and writing all the time. >>>>>>> Any suggestions? >>>>>>> >>>>>>> Thanks. >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>> >>> >>> >>> >>> -- >>> >>> Adrien Mogenet >>> Head of Backend/Infrastructure >>> [email protected] <mailto:[email protected]> >>> (+33)6.59.16.64.22 <tel:%28%2B33%296.59.16.64.22> >>> http://www.contentsquare.com <http://www.contentsquare.com/> >>> 50, avenue Montaigne - 75008 Paris >> >> >> >> >> -- >> >> Adrien Mogenet >> Head of Backend/Infrastructure >> [email protected] <mailto:[email protected]> >> (+33)6.59.16.64.22 <tel:%28%2B33%296.59.16.64.22> >> http://www.contentsquare.com <http://www.contentsquare.com/> >> 50, avenue Montaigne - 75008 Paris > > > > > -- > > Adrien Mogenet > Head of Backend/Infrastructure > [email protected] <mailto:[email protected]> > (+33)6.59.16.64.22 > http://www.contentsquare.com <http://www.contentsquare.com/> > 50, avenue Montaigne - 75008 Paris
