On Tue, Jan 11, 2011 at 2:34 PM, Wayne <[email protected]> wrote: > We have very frequent cluster wide pauses that stop all reads and writes > for seconds.
All reads and all writes? I've seen the pause too for writes. Its something I've always meant to look into. Friso postulates one cause. Another that we've talked of is a region taking a while to come back on line after a split or a rebalance for whatever reason. Client loading might be 'random' spraying over lots of random regions but they all get stuck waiting on one particular region to come back online. I suppose reads could be blocked for same reason if all are trying to read from the offlined region. What version of hbase are you using? Splits should be faster in 0.90 now that the split daughters come up on the same region. Sorry I don't have a better answer for you. Need to dig in. File a JIRA. If you want to help out some, stick some data up in it. Some suggestions would be to enable logging of when we lookup region locations in client and then note when requests go to zero. Can you figure what region the clients are waiting on (if they are waiting on any). If you can pull out a particular one, try and elicit its history at time of blockage. Is it being moved or mid-split? I suppose it makes sense that bigger regions would make the situation 'worse'. I can take a look at it too. St.Ack We are constantly loading data to this cluster of 10 nodes. > These pauses can happen as frequently as every minute but sometimes are not > seen for 15+ minutes. Basically watching the Region server list with request > counts is the only evidence of what is going on. All reads and writes > totally stop and if there is ever any activity it is on the node hosting the > .META. table with a request count of region count + 1. This problem seems to > be worse with a larger region size. We tried a 1GB region size and saw this > more than we saw actual activity (and stopped using a larger region size > because of it). We went back to the default region size and it was better, > but we had too many regions so now we are up to 512M for a region size and > we are seeing it more again. > > Does anyone know what this is? We have dug into all of the logs to find some > sort of pause but are not able to find anything. Is this an wal hlog roll? > Is this a region split or compaction? Of course our biggest fear is a GC > pause on the master but we do not have java logging turned on with the > master to tell. What could possibly stop the entire cluster from working for > seconds at a time very frequently? > > Thanks in advance for any ideas of what could be causing this. >
