Heaps are 16G w/ hfile.block.cache.size = 0.5 Machines have 32G onboard and we used to run w/ 24G heaps but reduced them to lower GC times.
Not so sure about which regions were hot. And I don't want to repeat and take down my cluster again :) What I know: 1) The request was about 4000 gets. 2) The 4000 keys are likely contiguous and therefore probably represent entire regions 3) Once we batched the gets (so as not to kill the cluster) the result was >10G of data in client. We blew the heap there :( 4) Our regions are 10G (hbase.hregion.max.filesize = 10737418240) Distributing these key via salting is not in our best interest as we typically do these types of timeseries queries (though only recently at this scale). I think I understand the failure mode, I guess I am just surprised that a greedy client can kill the cluster and that we are required to batch our gets in order to protect the cluster. From: Nick Dimiduk [mailto:[email protected]] Sent: Wednesday, February 25, 2015 9:40 AM To: hbase-user Cc: Ted Yu; Development Subject: Re: Table.get(List<Get>) overwhelms several RSs How large is your region server heap? What's your setting for hfile.block.cache.size? Can you identify which region is being burned up (i.e., is it META?) It is possible for a hot region to act as a "death pill" that roams around the cluster. We see this with the meta region with poorly-behaved clients. -n On Wed, Feb 25, 2015 at 8:38 AM, Ted Tuttle <[email protected]<mailto:[email protected]>> wrote: Hard to say how balanced the table is. We have a mixed requirement where we want some locality for timeseries queries against "clusters" of information. However the "clusters" in a table are should be well distributed if the dataset is large enough. The query in question killed 5 RSs so I am inferring either: 1) the table was spread across these 5 RSs 2) the query moved around on the cluster as RSs failed Perhaps you could tell me if #2 is possible. We are running v0.94.9 From: Ted Yu [mailto:[email protected]<mailto:[email protected]>] Sent: Wednesday, February 25, 2015 7:24 AM To: [email protected]<mailto:[email protected]> Cc: Development Subject: Re: Table.get(List<Get>) overwhelms several RSs Was the underlying table balanced (meaning its regions spread evenly across region servers) ? What release of HBase are you using ? Cheers On Wed, Feb 25, 2015 at 7:08 AM, Ted Tuttle <[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>> wrote: Hello- In the last week we had multiple times where we lost 5 of 8 RSs in the space of a few minutes because of slow GCs. We traced this back to a client calling Table.get(List<Get> gets) with a collection containing ~4000 individual gets. We've worked around this by limiting the number of Gets we send in a single call to Table.get(List<Get>) Is there some configuration parameter that we are missing here? Thanks, Ted
