bq. The 4000 keys are likely contiguous and therefore probably represent entire regions
In that case you can convert multi-get's to Scan with proper batch size and start/stop rows. Cheers On Wed, Feb 25, 2015 at 10:16 AM, Ted Tuttle <[email protected]> wrote: > Heaps are 16G w/ hfile.block.cache.size = 0.5 > > > > Machines have 32G onboard and we used to run w/ 24G heaps but reduced them > to lower GC times. > > > > Not so sure about which regions were hot. And I don't want to repeat and > take down my cluster again :) > > > > What I know: > > > > 1) The request was about 4000 gets. > > 2) The 4000 keys are likely contiguous and therefore probably represent > entire regions > > 3) Once we batched the gets (so as not to kill the cluster) the result was > >10G of data in client. We blew the heap there :( > > 4) Our regions are 10G (hbase.hregion.max.filesize = 10737418240) > > > > Distributing these key via salting is not in our best interest as we > typically do these types of timeseries queries (though only recently at > this scale). > > > > I think I understand the failure mode, I guess I am just surprised that a > greedy client can kill the cluster and that we are required to batch our > gets in order to protect the cluster. > > > > *From:* Nick Dimiduk [mailto:[email protected]] > *Sent:* Wednesday, February 25, 2015 9:40 AM > *To:* hbase-user > *Cc:* Ted Yu; Development > *Subject:* Re: Table.get(List<Get>) overwhelms several RSs > > > > How large is your region server heap? What's your setting > for hfile.block.cache.size? Can you identify which region is being burned > up (i.e., is it META?) > > > > It is possible for a hot region to act as a "death pill" that roams around > the cluster. We see this with the meta region with poorly-behaved clients. > > > > -n > > > > On Wed, Feb 25, 2015 at 8:38 AM, Ted Tuttle <[email protected]> wrote: > > Hard to say how balanced the table is. > > We have a mixed requirement where we want some locality for timeseries > queries against "clusters" of information. However the "clusters" in a > table are should be well distributed if the dataset is large enough. > > The query in question killed 5 RSs so I am inferring either: > > 1) the table was spread across these 5 RSs > 2) the query moved around on the cluster as RSs failed > > Perhaps you could tell me if #2 is possible. > > We are running v0.94.9 > > From: Ted Yu [mailto:[email protected]] > Sent: Wednesday, February 25, 2015 7:24 AM > To: [email protected] > Cc: Development > Subject: Re: Table.get(List<Get>) overwhelms several RSs > > Was the underlying table balanced (meaning its regions spread evenly > across region servers) ? > > What release of HBase are you using ? > > Cheers > > On Wed, Feb 25, 2015 at 7:08 AM, Ted Tuttle <[email protected]<mailto: > [email protected]>> wrote: > Hello- > > In the last week we had multiple times where we lost 5 of 8 RSs in the > space of a few minutes because of slow GCs. > > We traced this back to a client calling Table.get(List<Get> gets) with a > collection containing ~4000 individual gets. > > We've worked around this by limiting the number of Gets we send in a > single call to Table.get(List<Get>) > > Is there some configuration parameter that we are missing here? > Thanks, > Ted > > >
