Hi Adrien, for the last 24 hours all RS are up and running. There was no region transitions. The overall cluster iowait has decreased, but still 2 RS have very high iowait, while there is no load on the cluster. My assumption with the hight number of HDFS_READ/HDFS_WRITE in RS logs have failed, since all RS have almost identical number of HDFS_READ/HDFS_WRITE, while only 2 of them has high iowait. According to iotop the process which is doing most IO is datanode, and it is reading constantly. Why datanode could require reading from disk constantly? Any ideas?
Thanks. > On 03 Sep 2015, at 18:57, Adrien Mogenet <[email protected]> > wrote: > > Is the uptime of RS "normal"? No quick and global reboot that could lead into > a regiongi-reallocation-storm? > > On 3 September 2015 at 18:42, Akmal Abbasov <[email protected] > <mailto:[email protected]>> wrote: > Hi Adrien, > I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also hbase > is consistent. > I’m using default value of the replication, so it is 3. > There are some under replicated > HBase master(node 10.10.8.55) is reading constantly from regionservers. Only > today, it send >150.000 HDFS_READ requests to each regionserver so far, while > the hbase cluster is almost idle. > What could cause this kind of behaviour? > > p.s. each node in the cluster have 2 core, 4 gb ram, just in case. > > Thanks. > > >> On 03 Sep 2015, at 17:46, Adrien Mogenet <[email protected] >> <mailto:[email protected]>> wrote: >> >> Is your HDFS healthy (fsck /)? >> >> Same for hbase hbck? >> >> What's your replication level? >> >> Can you see constant network use as well? >> >> Anything than might be triggered by the hbasemaster? (something like a >> virtually dead RS, due to ZK race-condition, etc.) >> >> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major >> compaction, successfully, yesterday. >> >> On 3 September 2015 at 16:32, Akmal Abbasov <[email protected] >> <mailto:[email protected]>> wrote: >> I’ve started HDFS balancer, but then stopped it immediately after knowing >> that it is not a good idea. >> but it was around 3 weeks ago, is it possible that it had an influence on >> the cluster behaviour I’m having now? >> Thanks. >> >>> On 03 Sep 2015, at 14:23, Akmal Abbasov <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> Hi Ted, >>> No there is no short-circuit read configured. >>> The logs of datanode of the 10.10.8.55 are full of following messages >>> 2015-09-03 12:03:56,324 INFO >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: >>> /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 >>> <http://10.10.8.53:58622/>, bytes: 77, op: HDFS_READ, cliID: >>> DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: >>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: >>> BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: >>> 276448307 >>> 2015-09-03 12:03:56,494 INFO >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: >>> /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 >>> <http://10.10.8.53:58622/>, bytes: 538, op: HDFS_READ, cliID: >>> DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: >>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: >>> BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: >>> 60550244 >>> 2015-09-03 12:03:59,561 INFO >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: >>> /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 >>> <http://10.10.8.53:58622/>, bytes: 455, op: HDFS_READ, cliID: >>> DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: >>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: >>> BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: >>> 755613819 >>> There are >100.000 of them just for today. The situation with other >>> regionservers are similar. >>> Node 10.10.8.53 is hbase-master node, and the process on the port is also >>> hbase-master. >>> So if there is no load on the cluster, why there are so much IO happening? >>> Any thoughts. >>> Thanks. >>> >>>> On 02 Sep 2015, at 21:57, Ted Yu <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> >>>> I assume you have enabled short-circuit read. >>>> >>>> Can you capture region server stack trace(s) and pastebin them ? >>>> >>>> Thanks >>>> >>>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> Hi Ted, >>>> I’ve checked the time when addresses were changed, and this strange >>>> behaviour started weeks before it. >>>> >>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master. >>>> any thoughts? >>>> >>>> Thanks >>>> >>>>> On 02 Sep 2015, at 18:45, Ted Yu <[email protected] >>>>> <mailto:[email protected]>> wrote: >>>>> >>>>> bq. change the ip addresses of the cluster nodes >>>>> >>>>> Did this happen recently ? If high iowait was observed after the change >>>>> (you can look at ganglia graph), there is a chance that the change was >>>>> related. >>>>> >>>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region >>>>> server resides. >>>>> >>>>> Cheers >>>>> >>>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <[email protected] >>>>> <mailto:[email protected]>> wrote: >>>>> Hi Ted, >>>>> sorry forget to mention >>>>> >>>>>> release of hbase / hadoop you're using >>>>> >>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1 >>>>> >>>>>> were region servers doing compaction ? >>>>> >>>>> I’ve run major compactions manually earlier today, but it seems that they >>>>> already completed, looking at the compactionQueueSize. >>>>> >>>>>> have you checked region server logs ? >>>>> The logs of datanode is full of this kind of messages >>>>> 2015-09-02 16:37:06,950 INFO >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: >>>>> /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 >>>>> <http://10.10.8.54:32959/>, bytes: 19673, op: HDFS_READ, cliID: >>>>> DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: >>>>> ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: >>>>> BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: >>>>> 7881815 >>>>> >>>>> p.s. we had to change the ip addresses of the cluster nodes, is it >>>>> relevant? >>>>> >>>>> Thanks. >>>>> >>>>>> On 02 Sep 2015, at 18:20, Ted Yu <[email protected] >>>>>> <mailto:[email protected]>> wrote: >>>>>> >>>>>> Please provide some more information: >>>>>> >>>>>> release of hbase / hadoop you're using >>>>>> were region servers doing compaction ? >>>>>> have you checked region server logs ? >>>>>> >>>>>> Thanks >>>>>> >>>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <[email protected] >>>>>> <mailto:[email protected]>> wrote: >>>>>> Hi, >>>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only >>>>>> <5 puts and gets. >>>>>> But the data in hdfs is increasing, and region servers have very high >>>>>> iowait(>100, in 2 core CPU). >>>>>> iotop shows that datanode process is reading and writing all the time. >>>>>> Any suggestions? >>>>>> >>>>>> Thanks. >>>>>> >>>>> >>>>> >>>> >>>> >>> >> >> >> >> >> -- >> >> Adrien Mogenet >> Head of Backend/Infrastructure >> [email protected] <mailto:[email protected]> >> (+33)6.59.16.64.22 <tel:%28%2B33%296.59.16.64.22> >> http://www.contentsquare.com <http://www.contentsquare.com/> >> 50, avenue Montaigne - 75008 Paris > > > > > -- > > Adrien Mogenet > Head of Backend/Infrastructure > [email protected] <mailto:[email protected]> > (+33)6.59.16.64.22 > http://www.contentsquare.com <http://www.contentsquare.com/> > 50, avenue Montaigne - 75008 Paris
