what is the configs used to tune the run frequency of block scanner? or what event is used to trigger it to run?
Thanks. > On 07 Sep 2015, at 15:17, Ted Yu <[email protected]> wrote: > > W.r.t. Upgrade, this thread may be of interest to you: > > http://search-hadoop.com/m/uOzYt48qItawnLv1 > <http://search-hadoop.com/m/uOzYt48qItawnLv1> > > > > On Sep 7, 2015, at 5:15 AM, Akmal Abbasov <[email protected] > <mailto:[email protected]>> wrote: > >> While looking into this problem, I found that I have large >> dncp_block_verification.log.curr and dncp_block_verification.log.prev files. >> They are 294G each in the node which has high IOWAIT, even when the cluster >> was almost idle. >> While the others have 0 for dncp_block_verification.log.curr, and <15G for >> dncp_block_verification.log.prev. >> So it looks like https://issues.apache.org/jira/browse/HDFS-6114 >> <https://issues.apache.org/jira/browse/HDFS-6114> >> >> Thanks. >> >>> On 04 Sep 2015, at 11:56, Adrien Mogenet <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> What is your disk configuration? JBOD? If RAID, possibly a dysfunctional >>> RAID controller, or a constantly-rebuilding array. >>> >>> Do you have any idea at which files are linked the read blocks? >>> >>> On 4 September 2015 at 11:02, Akmal Abbasov <[email protected] >>> <mailto:[email protected]>> wrote: >>> Hi Adrien, >>> for the last 24 hours all RS are up and running. There was no region >>> transitions. >>> The overall cluster iowait has decreased, but still 2 RS have very high >>> iowait, while there is no load on the cluster. >>> My assumption with the hight number of HDFS_READ/HDFS_WRITE in RS logs have >>> failed, since all RS have almost identical number >>> of HDFS_READ/HDFS_WRITE, while only 2 of them has high iowait. >>> According to iotop the process which is doing most IO is datanode, and it >>> is reading constantly. >>> Why datanode could require reading from disk constantly? >>> Any ideas? >>> >>> Thanks. >>> >>>> On 03 Sep 2015, at 18:57, Adrien Mogenet <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> >>>> Is the uptime of RS "normal"? No quick and global reboot that could lead >>>> into a regiongi-reallocation-storm? >>>> >>>> On 3 September 2015 at 18:42, Akmal Abbasov <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> Hi Adrien, >>>> I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also >>>> hbase is consistent. >>>> I’m using default value of the replication, so it is 3. >>>> There are some under replicated >>>> HBase master(node 10.10.8.55) is reading constantly from regionservers. >>>> Only today, it send >150.000 HDFS_READ requests to each regionserver so >>>> far, while the hbase cluster is almost idle. >>>> What could cause this kind of behaviour? >>>> >>>> p.s. each node in the cluster have 2 core, 4 gb ram, just in case. >>>> >>>> Thanks. >>>> >>>> >>>>> On 03 Sep 2015, at 17:46, Adrien Mogenet >>>>> <[email protected] >>>>> <mailto:[email protected]>> wrote: >>>>> >>>>> Is your HDFS healthy (fsck /)? >>>>> >>>>> Same for hbase hbck? >>>>> >>>>> What's your replication level? >>>>> >>>>> Can you see constant network use as well? >>>>> >>>>> Anything than might be triggered by the hbasemaster? (something like a >>>>> virtually dead RS, due to ZK race-condition, etc.) >>>>> >>>>> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major >>>>> compaction, successfully, yesterday. >>>>> >>>>> On 3 September 2015 at 16:32, Akmal Abbasov <[email protected] >>>>> <mailto:[email protected]>> wrote: >>>>> I’ve started HDFS balancer, but then stopped it immediately after knowing >>>>> that it is not a good idea. >>>>> but it was around 3 weeks ago, is it possible that it had an influence on >>>>> the cluster behaviour I’m having now? >>>>> Thanks. >>>>> >>>>>> On 03 Sep 2015, at 14:23, Akmal Abbasov <[email protected] >>>>>> <mailto:[email protected]>> wrote: >>>>>> >>>>>> Hi Ted, >>>>>> No there is no short-circuit read configured. >>>>>> The logs of datanode of the 10.10.8.55 are full of following messages >>>>>> 2015-09-03 12:03:56,324 INFO >>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: >>>>>> /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 >>>>>> <http://10.10.8.53:58622/>, bytes: 77, op: HDFS_READ, cliID: >>>>>> DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: >>>>>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: >>>>>> BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: >>>>>> 276448307 >>>>>> 2015-09-03 12:03:56,494 INFO >>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: >>>>>> /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 >>>>>> <http://10.10.8.53:58622/>, bytes: 538, op: HDFS_READ, cliID: >>>>>> DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: >>>>>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: >>>>>> BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: >>>>>> 60550244 >>>>>> 2015-09-03 12:03:59,561 INFO >>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: >>>>>> /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 >>>>>> <http://10.10.8.53:58622/>, bytes: 455, op: HDFS_READ, cliID: >>>>>> DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: >>>>>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: >>>>>> BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: >>>>>> 755613819 >>>>>> There are >100.000 of them just for today. The situation with other >>>>>> regionservers are similar. >>>>>> Node 10.10.8.53 is hbase-master node, and the process on the port is >>>>>> also hbase-master. >>>>>> So if there is no load on the cluster, why there are so much IO >>>>>> happening? >>>>>> Any thoughts. >>>>>> Thanks. >>>>>> >>>>>>> On 02 Sep 2015, at 21:57, Ted Yu <[email protected] >>>>>>> <mailto:[email protected]>> wrote: >>>>>>> >>>>>>> I assume you have enabled short-circuit read. >>>>>>> >>>>>>> Can you capture region server stack trace(s) and pastebin them ? >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov >>>>>>> <[email protected] <mailto:[email protected]>> wrote: >>>>>>> Hi Ted, >>>>>>> I’ve checked the time when addresses were changed, and this strange >>>>>>> behaviour started weeks before it. >>>>>>> >>>>>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master. >>>>>>> any thoughts? >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>>> On 02 Sep 2015, at 18:45, Ted Yu <[email protected] >>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>> >>>>>>>> bq. change the ip addresses of the cluster nodes >>>>>>>> >>>>>>>> Did this happen recently ? If high iowait was observed after the >>>>>>>> change (you can look at ganglia graph), there is a chance that the >>>>>>>> change was related. >>>>>>>> >>>>>>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your >>>>>>>> region server resides. >>>>>>>> >>>>>>>> Cheers >>>>>>>> >>>>>>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov >>>>>>>> <[email protected] <mailto:[email protected]>> wrote: >>>>>>>> Hi Ted, >>>>>>>> sorry forget to mention >>>>>>>> >>>>>>>>> release of hbase / hadoop you're using >>>>>>>> >>>>>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1 >>>>>>>> >>>>>>>>> were region servers doing compaction ? >>>>>>>> >>>>>>>> I’ve run major compactions manually earlier today, but it seems that >>>>>>>> they already completed, looking at the compactionQueueSize. >>>>>>>> >>>>>>>>> have you checked region server logs ? >>>>>>>> The logs of datanode is full of this kind of messages >>>>>>>> 2015-09-02 16:37:06,950 INFO >>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: >>>>>>>> /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 >>>>>>>> <http://10.10.8.54:32959/>, bytes: 19673, op: HDFS_READ, cliID: >>>>>>>> DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: >>>>>>>> ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: >>>>>>>> BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, >>>>>>>> duration: 7881815 >>>>>>>> >>>>>>>> p.s. we had to change the ip addresses of the cluster nodes, is it >>>>>>>> relevant? >>>>>>>> >>>>>>>> Thanks. >>>>>>>> >>>>>>>>> On 02 Sep 2015, at 18:20, Ted Yu <[email protected] >>>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>>> >>>>>>>>> Please provide some more information: >>>>>>>>> >>>>>>>>> release of hbase / hadoop you're using >>>>>>>>> were region servers doing compaction ? >>>>>>>>> have you checked region server logs ? >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> >>>>>>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov >>>>>>>>> <[email protected] <mailto:[email protected]>> wrote: >>>>>>>>> Hi, >>>>>>>>> I’m having strange behaviour in hbase cluster. It is almost idle, >>>>>>>>> only <5 puts and gets. >>>>>>>>> But the data in hdfs is increasing, and region servers have very high >>>>>>>>> iowait(>100, in 2 core CPU). >>>>>>>>> iotop shows that datanode process is reading and writing all the time. >>>>>>>>> Any suggestions? >>>>>>>>> >>>>>>>>> Thanks. >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Adrien Mogenet >>>>> Head of Backend/Infrastructure >>>>> [email protected] <mailto:[email protected]> >>>>> (+33)6.59.16.64.22 <tel:%28%2B33%296.59.16.64.22> >>>>> http://www.contentsquare.com <http://www.contentsquare.com/> >>>>> 50, avenue Montaigne - 75008 Paris >>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Adrien Mogenet >>>> Head of Backend/Infrastructure >>>> [email protected] <mailto:[email protected]> >>>> (+33)6.59.16.64.22 <tel:%28%2B33%296.59.16.64.22> >>>> http://www.contentsquare.com <http://www.contentsquare.com/> >>>> 50, avenue Montaigne - 75008 Paris >>> >>> >>> >>> >>> -- >>> >>> Adrien Mogenet >>> Head of Backend/Infrastructure >>> [email protected] <mailto:[email protected]> >>> (+33)6.59.16.64.22 >>> http://www.contentsquare.com <http://www.contentsquare.com/> >>> 50, avenue Montaigne - 75008 Paris >>
