Hi Wayne, > We are seeing some TCP Resets on all nodes at the same time, and sometimes > quite a lot of them.
Have you checked this article from Andrei and Cosmin? They had a busy firewall to cause network blackout. http://hstack.org/hbase-performance-testing/ Maybe it's not your case but just for sure. Thanks, -- Tatsuya Kawano (Mr.) Tokyo, Japan On Jan 13, 2011, at 4:52 AM, Wayne <[email protected]> wrote: > We are seeing some TCP Resets on all nodes at the same time, and sometimes > quite a lot of them. We have yet to correlate the pauses to the TCP resets > but I am starting to wonder if this is partly a network problem. Does > Gigabit Ethernet break down on high volume nodes? Do high volume nodes use > 10G or Infiniband? > > > On Wed, Jan 12, 2011 at 1:52 PM, Stack <[email protected]> wrote: > >> Jon asks that you describe your loading in the issue. Would you mind >> doing so. Ted, stick up in the issue the workload and configs. you >> are running if you don't mind. I'd like to try it over here. >> Thanks lads, >> St.Ack >> >> >> On Wed, Jan 12, 2011 at 9:03 AM, Wayne <[email protected]> wrote: >>> Added: https://issues.apache.org/jira/browse/HBASE-3438. >>> >>> On Wed, Jan 12, 2011 at 11:40 AM, Wayne <[email protected]> wrote: >>> >>>> We are using 0.89.20100924, r1001068 >>>> >>>> We are seeing see it during heavy write load (which is all the time), >> but >>>> yesterday we had read load as well as write load and saw both reads and >>>> writes stop for 10+ seconds. The region size is the biggest clue we have >>>> found from our tests as setting up a new cluster with a 1GB max region >> size >>>> and starting to load heavily we will see this a lot for long long time >>>> frames. Maybe the bigger file gets hung up more easily with a split? >> Your >>>> description below also fits in that early on the load is not balanced so >> it >>>> is easier to stop everything on one node as the balance is not great >> early >>>> on. I will file a JIRA. I will also try to dig deeper into the logs >> during >>>> the pauses to find a node that might be stuck in a split. >>>> >>>> >>>> >>>> On Wed, Jan 12, 2011 at 11:17 AM, Stack <[email protected]> wrote: >>>> >>>>> On Tue, Jan 11, 2011 at 2:34 PM, Wayne <[email protected]> wrote: >>>>>> We have very frequent cluster wide pauses that stop all reads and >>>>> writes >>>>>> for seconds. >>>>> >>>>> All reads and all writes? >>>>> >>>>> I've seen the pause too for writes. Its something I've always meant >>>>> to look into. Friso postulates one cause. Another that we've talked >>>>> of is a region taking a while to come back on line after a split or a >>>>> rebalance for whatever reason. Client loading might be 'random' >>>>> spraying over lots of random regions but they all get stuck waiting on >>>>> one particular region to come back online. >>>>> >>>>> I suppose reads could be blocked for same reason if all are trying to >>>>> read from the offlined region. >>>>> >>>>> What version of hbase are you using? Splits should be faster in 0.90 >>>>> now that the split daughters come up on the same region. >>>>> >>>>> Sorry I don't have a better answer for you. Need to dig in. >>>>> >>>>> File a JIRA. If you want to help out some, stick some data up in it. >>>>> Some suggestions would be to enable logging of when we lookup region >>>>> locations in client and then note when requests go to zero. Can you >>>>> figure what region the clients are waiting on (if they are waiting on >>>>> any). If you can pull out a particular one, try and elicit its >>>>> history at time of blockage. Is it being moved or mid-split? I >>>>> suppose it makes sense that bigger regions would make the situation >>>>> 'worse'. I can take a look at it too. >>>>> >>>>> St.Ack >>>>> >>>>> >>>>> >>>>> >>>>> We are constantly loading data to this cluster of 10 nodes. >>>>>> These pauses can happen as frequently as every minute but sometimes >> are >>>>> not >>>>>> seen for 15+ minutes. Basically watching the Region server list with >>>>> request >>>>>> counts is the only evidence of what is going on. All reads and writes >>>>>> totally stop and if there is ever any activity it is on the node >> hosting >>>>> the >>>>>> .META. table with a request count of region count + 1. This problem >>>>> seems to >>>>>> be worse with a larger region size. We tried a 1GB region size and >> saw >>>>> this >>>>>> more than we saw actual activity (and stopped using a larger region >> size >>>>>> because of it). We went back to the default region size and it was >>>>> better, >>>>>> but we had too many regions so now we are up to 512M for a region >> size >>>>> and >>>>>> we are seeing it more again. >>>>>> >>>>>> Does anyone know what this is? We have dug into all of the logs to >> find >>>>> some >>>>>> sort of pause but are not able to find anything. Is this an wal hlog >>>>> roll? >>>>>> Is this a region split or compaction? Of course our biggest fear is a >> GC >>>>>> pause on the master but we do not have java logging turned on with >> the >>>>>> master to tell. What could possibly stop the entire cluster from >> working >>>>> for >>>>>> seconds at a time very frequently? >>>>>> >>>>>> Thanks in advance for any ideas of what could be causing this. >>>>>> >>>>> >>>> >>>> >>> >>
