Heaps are 16G w/ hfile.block.cache.size = 0.5

Machines have 32G onboard and we used to run w/ 24G heaps but reduced them to 
lower GC times.

Not so sure about which regions were hot.  And I don't want to repeat and take 
down my cluster again :)

What I know:

1) The request was about 4000 gets.
2) The 4000 keys are likely contiguous and therefore probably represent entire 
regions
3) Once we batched the gets (so as not to kill the cluster) the result was >10G 
of data in client. We blew the heap there :(
4) Our regions are 10G (hbase.hregion.max.filesize  = 10737418240)

Distributing these key via salting is not in our best interest as we typically 
do these types of timeseries queries (though only recently at this scale).

I think I understand the failure mode, I guess I am just surprised that a 
greedy client can kill the cluster and that we are required to batch our gets 
in order to protect the cluster.

From: Nick Dimiduk [mailto:[email protected]]
Sent: Wednesday, February 25, 2015 9:40 AM
To: hbase-user
Cc: Ted Yu; Development
Subject: Re: Table.get(List<Get>) overwhelms several RSs

How large is your region server heap? What's your setting for 
hfile.block.cache.size? Can you identify which region is being burned up (i.e., 
is it META?)

It is possible for a hot region to act as a "death pill" that roams around the 
cluster. We see this with the meta region with poorly-behaved clients.

-n

On Wed, Feb 25, 2015 at 8:38 AM, Ted Tuttle 
<[email protected]<mailto:[email protected]>> wrote:
Hard to say how balanced the table is.

We have a mixed requirement where we want some locality for timeseries queries 
against "clusters" of information.  However the "clusters" in a table are 
should be well distributed if the dataset is large enough.

The query in question killed 5 RSs so I am inferring either:

1) the table was spread across these 5 RSs
2) the query moved around on the cluster as RSs failed

Perhaps you could tell me if #2 is possible.

We are running v0.94.9

From: Ted Yu [mailto:[email protected]<mailto:[email protected]>]
Sent: Wednesday, February 25, 2015 7:24 AM
To: [email protected]<mailto:[email protected]>
Cc: Development
Subject: Re: Table.get(List<Get>) overwhelms several RSs

Was the underlying table balanced (meaning its regions spread evenly across 
region servers) ?

What release of HBase are you using ?

Cheers
On Wed, Feb 25, 2015 at 7:08 AM, Ted Tuttle 
<[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>
 wrote:
Hello-

In the last week we had multiple times where we lost 5 of 8 RSs in the space of 
a few minutes because of slow GCs.

We traced this back to a client calling Table.get(List<Get> gets) with a 
collection containing ~4000 individual gets.

We've worked around this by limiting the number of Gets we send in a single 
call to Table.get(List<Get>)

Is there some configuration parameter that we are missing here?
Thanks,
Ted

Reply via email to