On Mon, Aug 30, 2010 at 1:40 PM, Jean-Daniel Cryans <[email protected]> wrote: > Vidhya, > > Thanks for keeping it up :) > > If you trace all the references to 0000003404803994 in the region > server, what do you see? I think that most of the time is spent > opening the region sequentially, it'd be nice to confirm. > > J-D > > On Mon, Aug 30, 2010 at 10:20 AM, Vidhyashankar Venkataraman > <[email protected]> wrote: >> Thank you for the responses guys.. I will be looking at the problem later >> this week, I guess. The initial eyeballing suggested that the RS's respond >> to the master's open very quickly.. And I saw our rrd stats (which is >> similar to ganglia) and I couldn't see any obvious IO/CPU/nw bottlenecks: >> which means I haven't explored enough :) I will be coming back to the >> problem soon and will let you know about my findings.. >> >> I have been running scans and planning to run incremental loads on the db >> and get some useful numbers.. Scans are blazing fast with just one storefile >> per region: 75-95 MBps per node! (this is if locality between RS and data is >> kind of preserved: i.e., after a major compact is issued).. >> >> I will update on my status soon.. >> Thank you again >> Vidhya >> >> On 8/27/10 5:42 PM, "Jonathan Gray" <[email protected]> wrote: >> >> Vidhya, >> >> Could you post a snippet of an RS log during this time? You should be able >> to see what's happening between when the OPEN message gets there and the >> OPEN completes. >> >> Like Stack said, it's probably that its single-threaded in the version >> you're using and with all the file opening, your NN and DNs are under heavy >> load. Do you see io-wait or anything else jump up across the cluster at >> this time? You have ganglia setup on this bad boy? >> >> JG >> >>> -----Original Message----- >>> From: [email protected] [mailto:[email protected]] On Behalf Of >>> Stack >>> Sent: Friday, August 27, 2010 5:36 PM >>> To: [email protected] >>> Subject: Re: Initial region loads in hbase.. >>> >>> In 0.20, open on a regionserver is single-threaded. Could that be it? >>> The server has lots of regions to open and its taking time? Is the >>> meta table being beat up? Could this be holding up region opens? >>> >>> Good luck V, >>> St.Ack >>> >>> >>> On Fri, Aug 27, 2010 at 5:01 PM, Vidhyashankar Venkataraman >>> <[email protected]> wrote: >>> > Hi guys, >>> > A couple of days back, I had posted a problem on regions taking too >>> much time to load when I restart Hbase.. I have a table that has around >>> 80 K regions on 650 nodes (!) .. >>> > I was checking the logs in the master and I notice that the time it >>> takes from assigning a region to a region server to the point when it >>> recognizes that the region is open in that server takes around 20-30 >>> minutes! >>> > Apart from master being the bottleneck here, can you guys let me >>> know what the other possible cases are as to why this may happen? >>> > >>> > Thank you >>> > Vidhya >>> > >>> > Below is an example for region with start key 0000003404803994 where >>> the assignment takes place at 22:59 while the confirmation that it got >>> open came at 23:19... >>> > >>> > 2010-08-27 22:59:02,642 DEBUG >>> org.apache.hadoop.hbase.master.BaseScanner: Current assignment of >>> DocDB,0000003404803994,1282947439133.73c0f8fdb8ffbc20b9a239d325932ff1. >>> is not valid; serverAddress=, startCode=0 unknown. >>> > 2010-08-27 22:59:02,643 DEBUG >>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED >>> region 73c0f8fdb8ffbc20b9a239d325932ff1 in state = M2ZK_REGION_OFFLINE >>> > 2010-08-27 22:59:02,645 DEBUG org.apache.hadoop.hbase.master.HMaster: >>> Event NodeCreated with state SyncConnected with path >>> /hbase/UNASSIGNED/73c0f8fdb8ffbc20b9a239d325932ff1 >>> > 2010-08-27 22:59:02,645 DEBUG >>> org.apache.hadoop.hbase.master.ZKMasterAddressWatcher: Got event >>> NodeCreated with path >>> /hbase/UNASSIGNED/73c0f8fdb8ffbc20b9a239d325932ff1 >>> > 2010-08-27 22:59:02,645 DEBUG >>> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: ZK-EVENT-PROCESS: >>> Got zkEvent NodeCreated state:SyncConnected >>> path:/hbase/UNASSIGNED/73c0f8fdb8ffbc20b9a239d325932ff1 >>> > 2010-08-27 22:59:02,645 DEBUG >>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: >>> <b3130520.yst.yahoo.net,b3130560.yst.yahoo.net,b3130600.yst.yahoo.net,b >>> 3130640.yst.yahoo.net,b3130680.yst.yahoo.net:/hbase,org.apache.hadoop.h >>> base.master.HMaster>Created ZNode >>> /hbase/UNASSIGNED/73c0f8fdb8ffbc20b9a239d325932ff1 in ZooKeeper >>> > 2010-08-27 22:59:02,646 DEBUG >>> org.apache.hadoop.hbase.master.RegionManager: Created/updated >>> UNASSIGNED zNode >>> DocDB,0000003404803994,1282947439133.73c0f8fdb8ffbc20b9a239d325932ff1. >>> in state M2ZK_REGION_OFFLINE >>> > 2010-08-27 22:59:02,646 DEBUG >>> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: Got event type [ >>> M2ZK_REGION_OFFLINE ] for region 73c0f8fdb8ffbc20b9a239d325932ff1 >>> > 2010-08-27 22:59:02,646 DEBUG org.apache.hadoop.hbase.master.HMaster: >>> Event NodeChildrenChanged with state SyncConnected with path >>> /hbase/UNASSIGNED >>> > 2010-08-27 22:59:02,646 DEBUG >>> org.apache.hadoop.hbase.master.ZKMasterAddressWatcher: Got event >>> NodeChildrenChanged with path /hbase/UNASSIGNED >>> > 2010-08-27 22:59:02,646 DEBUG >>> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: ZK-EVENT-PROCESS: >>> Got zkEvent NodeChildrenChanged state:SyncConnected >>> path:/hbase/UNASSIGNED >>> > 2010-08-27 22:59:02,647 DEBUG >>> org.apache.hadoop.hbase.master.RegionManager: Assigning for >>> serverName=b3130020.yst.yahoo.net,60020,1282940735627, >>> load=(requests=0, regions=76, usedHeap=80, maxHeap=7993): total >>> nregions to assign=1, regions to give other servers than this=0, >>> isMetaAssign=false >>> > 2010-08-27 22:59:02,647 DEBUG >>> org.apache.hadoop.hbase.master.RegionManager: Assigning >>> serverName=b3130020.yst.yahoo.net,60020,1282940735627, >>> load=(requests=0, regions=76, usedHeap=80, maxHeap=7993) 1 regions >>> > 2010-08-27 22:59:02,647 INFO >>> org.apache.hadoop.hbase.master.RegionManager: Assigning region >>> DocDB,0000003404803994,1282947439133.73c0f8fdb8ffbc20b9a239d325932ff1. >>> to b3130020.yst.yahoo.net,60020,1282940735627 >>> > 2010-08-27 22:59:02,653 DEBUG >>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: No need to update >>> UNASSIGNED region 73c0f8fdb8ffbc20b9a239d325932ff1 as it already exists >>> in state = M2ZK_REGION_OFFLINE >>> > 2010-08-27 22:59:02,653 DEBUG >>> org.apache.hadoop.hbase.master.RegionManager: Created UNASSIGNED zNode >>> DocDB,0000003404803994,1282947439133.73c0f8fdb8ffbc20b9a239d325932ff1. >>> in state M2ZK_REGION_OFFLINE >>> > >>> > >>> > >>> > >>> > AND THEN, >>> > >>> > 2010-08-27 23:18:53,591 INFO >>> org.apache.hadoop.hbase.master.RegionServerOperation: >>> DocDB,0000003404803994,1282947439133.73c0f8fdb8ffbc20b9a239d325932ff1. >>> open on b3130020.yst.yahoo.net,60020,1282940735627 >>> > 2010-08-27 23:18:53,591 INFO >>> org.apache.hadoop.hbase.master.RegionServerOperation: Updated row >>> DocDB,0000003404803994,1282947439133.73c0f8fdb8ffbc20b9a239d325932ff1. >>> in region .META.,,1 with startcode=1282940735627, >>> server=b3130020.yst.yahoo.net:60020 >>> > 2010-08-27 23:18:53,677 DEBUG >>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Deleting ZNode >>> /hbase/UNASSIGNED/73c0f8fdb8ffbc20b9a239d325932ff1 in ZooKeeper as >>> region is open... >>> > >>> > >>> > >> >> >
Just an FYI. hadoop-cacti-jtg (http://www.jointhegrid.com/hadoop/index.jsp) had excellent graph coverage for all the jmx attributes in hbase-20.3. http://www.jointhegrid.com/hadoop-cacti-jtg-walk/hbase_region_blockcache.jsp http://www.jointhegrid.com/hadoop-cacti-jtg-walk/hbase_master.jsp Has their been any major changes/additions to hbase jmx recently? Edward
