Added: https://issues.apache.org/jira/browse/HBASE-3438.
On Wed, Jan 12, 2011 at 11:40 AM, Wayne <[email protected]> wrote: > We are using 0.89.20100924, r1001068 > > We are seeing see it during heavy write load (which is all the time), but > yesterday we had read load as well as write load and saw both reads and > writes stop for 10+ seconds. The region size is the biggest clue we have > found from our tests as setting up a new cluster with a 1GB max region size > and starting to load heavily we will see this a lot for long long time > frames. Maybe the bigger file gets hung up more easily with a split? Your > description below also fits in that early on the load is not balanced so it > is easier to stop everything on one node as the balance is not great early > on. I will file a JIRA. I will also try to dig deeper into the logs during > the pauses to find a node that might be stuck in a split. > > > > On Wed, Jan 12, 2011 at 11:17 AM, Stack <[email protected]> wrote: > >> On Tue, Jan 11, 2011 at 2:34 PM, Wayne <[email protected]> wrote: >> > We have very frequent cluster wide pauses that stop all reads and >> writes >> > for seconds. >> >> All reads and all writes? >> >> I've seen the pause too for writes. Its something I've always meant >> to look into. Friso postulates one cause. Another that we've talked >> of is a region taking a while to come back on line after a split or a >> rebalance for whatever reason. Client loading might be 'random' >> spraying over lots of random regions but they all get stuck waiting on >> one particular region to come back online. >> >> I suppose reads could be blocked for same reason if all are trying to >> read from the offlined region. >> >> What version of hbase are you using? Splits should be faster in 0.90 >> now that the split daughters come up on the same region. >> >> Sorry I don't have a better answer for you. Need to dig in. >> >> File a JIRA. If you want to help out some, stick some data up in it. >> Some suggestions would be to enable logging of when we lookup region >> locations in client and then note when requests go to zero. Can you >> figure what region the clients are waiting on (if they are waiting on >> any). If you can pull out a particular one, try and elicit its >> history at time of blockage. Is it being moved or mid-split? I >> suppose it makes sense that bigger regions would make the situation >> 'worse'. I can take a look at it too. >> >> St.Ack >> >> >> >> >> We are constantly loading data to this cluster of 10 nodes. >> > These pauses can happen as frequently as every minute but sometimes are >> not >> > seen for 15+ minutes. Basically watching the Region server list with >> request >> > counts is the only evidence of what is going on. All reads and writes >> > totally stop and if there is ever any activity it is on the node hosting >> the >> > .META. table with a request count of region count + 1. This problem >> seems to >> > be worse with a larger region size. We tried a 1GB region size and saw >> this >> > more than we saw actual activity (and stopped using a larger region size >> > because of it). We went back to the default region size and it was >> better, >> > but we had too many regions so now we are up to 512M for a region size >> and >> > we are seeing it more again. >> > >> > Does anyone know what this is? We have dug into all of the logs to find >> some >> > sort of pause but are not able to find anything. Is this an wal hlog >> roll? >> > Is this a region split or compaction? Of course our biggest fear is a GC >> > pause on the master but we do not have java logging turned on with the >> > master to tell. What could possibly stop the entire cluster from working >> for >> > seconds at a time very frequently? >> > >> > Thanks in advance for any ideas of what could be causing this. >> > >> > >
