>  We are also using a 5Gb region size to keep our region
> counts in the 100-200 range/node per Jonathan Grey's recommendation.

So there isn't a penalty incurred from increasing the max region size
from 256MB to 5GB?

On Fri, Feb 18, 2011 at 10:12 AM, Wayne <[email protected]> wrote:
> We have managed to get a little more than 1k QPS to date with 10 nodes.
> Honestly we are not quite convinced that disk i/o seeks are our biggest
> bottleneck. Of course they should be...but waiting for RPC connections,
> network latency, thrift etc. all play into the time to get reads. The std
> dev. of read time is way too high for our comfort, but os cache and other
> things make it hard to benchmark this. Our average read time has jumped to
> ~65ms which is a lot more than we expected (we expected 30-40ms). Our reads
> are not as small as originally thought, but the 65ms still seems high. We
> would love to have added a scanOpenWithStop single rpc read call (open, get
> all rows, close scanner in one rpc call). We are moving to 10k disks (5 data
> disks per node) so once we get some of these nodes in place we can see how
> things compare. I suspect things won't change much...which will confirm disk
> i/o is not our only bottleneck.  I will be happy to be wrong...
>
> Based on the performance we have seen we expect to need 40 nodes. With some
> fine tuning the 40 nodes can deliver 5k QPS which is what we need to run our
> application processes. We have a substantial write load as well, so our
> planning and numbers allow spare cycles for dealing with significant writes
> at the same time. We have cache set way low to just be used by .META. and
> all our tables have cache turned off. Memcached will be sitting on top to
> help client access. We are also using a 5Gb region size to keep our region
> counts in the 100-200 range/node per Jonathan Grey's recommendation.
>
> We have thought about going to infiniband or 10g, but with our smaller node
> sizes we don't think it will make much difference. The cost of infiniband
> for all 40 nodes buys us another 6 nodes...money better spent on scaling
> out.
>
>
> On Fri, Feb 18, 2011 at 2:15 AM, M. C. Srivas <[email protected]> wrote:
>
>> I was reading this thread with interest. Here's my $.02
>>
>> On Fri, Dec 17, 2010 at 12:29 PM, Wayne <[email protected]> wrote:
>>
>>> Sorry, I am sure my questions were far too broad to answer.
>>>
>>> Let me *try* to ask more specific questions. Assuming all data requests
>>> are
>>> cold (random reading pattern) and everything comes from the disks (no
>>> block
>>> cache), what level of concurrency can HDFS handle?
>>
>>
>> Cold cache, random reads ==> totally governed by seeks, so governed by # of
>> spindles per box.
>>
>> A SATA drive can do about 100 random seeks per sec, ie, 100 reads/second
>>
>>
>>
>>> Almost all of the load is
>>> controlled data processing, but we have to do a lot of work at night
>>> during
>>> the batch window so something in the 15-20,000 QPS range would meet
>>> current
>>> worse case requirements. How many nodes would be required to effectively
>>> return data against a 50TB aggregate data store?
>>
>>
>> Assuming 12 drives per node, and a cache hit rate of 0% (since its most
>> cold cache), you will see about 12 * 100  = 1200 reads per second per node.
>> If your cache hit rate goes up to 25%, then, your read rate is 1600
>> reads/sec/node
>> Thus, 10 machines can serve about 12k-16k reads/sec, cold cache.
>>
>> 50TB of data on 10 machine => 5 TB/node. That might be bit too much for
>> each region server (a RS can do about 700 regions comfortably, each of 1G).
>> If you push, you might get 2TB/regionserver, or, 25 machines for total. If
>> the data compresses 50%, then 12-13 nodes.
>>
>> So, for your scenario, its about 12-13 RS's, with 12 drives each, and you
>> will comfortably do 24k QPS cold cache.
>>
>> Does that help?
>>
>>
>>
>>> Disk I/O assumedly starts
>>> to break down at a certain point in terms of concurrent readers/node/disk.
>>> We have in our control how many total concurrent readers there are, so if
>>> we
>>> can get 10ms response time with 100 readers that might be better than
>>> 100ms
>>> responses from 1000 concurrent readers.
>>>
>>> Thanks.
>>>
>>>
>>> On Fri, Dec 17, 2010 at 2:46 PM, Jean-Daniel Cryans <[email protected]
>>> >wrote:
>>>
>>> > Hi Wayne,
>>> >
>>> > This question has such a large scope but is applicable to such a tiny
>>> > subset of workloads (eg yours) that fielding all the questions in
>>> > details would probably end up just wasting everyone's cycles. So first
>>> > I'd like to clear up some confusion.
>>> >
>>> > > We would like some help with cluster sizing estimates. We have 15TB of
>>> > > currently relational data we want to store in hbase.
>>> >
>>> > There's the 3x replication factor, but also you have to account that
>>> > each value is stored with it's row key, family name, qualifier and
>>> > timestamp. That could be a lot more data to store, but at the same
>>> > time you can use LZO compression to bring that down ~4x.
>>> >
>>> > > How many nodes, regions, etc. are we going to need?
>>> >
>>> > You don't really have the control over regions, they are created for
>>> > you as your data grows.
>>> >
>>> > > What will our read latency be for 30 vs. 100? Sure we can pack 20
>>> nodes
>>> > with 3TB
>>> > > of data each but will it take 1+s for every get?
>>> >
>>> > I'm not sure what kind of back-of-the-envelope calculations took you
>>> > to 1sec, but latency will be strictly determined by concurrency and
>>> > actual machine load. Even if you were able to pack 20TB in one onde
>>> > but using a tiny portion of it, you would still get sub 100ms
>>> > latencies. Or if you have only 10GB on that node but it's getting
>>> > hammered by 10000 clients, then you should expect much higher
>>> > latencies.
>>> >
>>> > > Will compaction run for 3 days?
>>> >
>>> > Which compactions? Major ones? If you don't insert new data in a
>>> > region, it won't be major compacted. Also if you have that much data,
>>> > I would set the time between major compactions to be bigger than 1
>>> > day. Heck, since you are doing time series, this means you'll never
>>> > delete anything right? So you might as well disable them.
>>> >
>>> > And now for the meaty part...
>>> >
>>> > The size of your dataset is only one part of the equation, the other
>>> > being traffic you would be pushing to the cluster which I think wasn't
>>> > covered at all in your email. Like I said previously, you can pack a
>>> > lot of data in a single node and can retrieve it really fast as long
>>> > as concurrency is low. Another thing is how random your reading
>>> > pattern is... can you even leverage the block cache at all? If yes,
>>> > then you can accept more concurrency, if not then hitting HDFS is a
>>> > lot slower (and it's still not very good at handling many clients).
>>> >
>>> > Unfortunately, even if you gave us exactly how many QPS you want to do
>>> > per second, we'd have a hard time recommending any number of nodes.
>>> >
>>> > What I would recommend then is to benchmark it. Try to grab 5-6
>>> > machines, load a subset of the data, generate traffic, see how it
>>> > behaves.
>>> >
>>> > Hope that helps,
>>> >
>>> > J-D
>>> >
>>> > On Fri, Dec 17, 2010 at 9:09 AM, Wayne <[email protected]> wrote:
>>> > > We would like some help with cluster sizing estimates. We have 15TB of
>>> > > currently relational data we want to store in hbase. Once that is
>>> > replicated
>>> > > to a factor of 3 and stored with secondary indexes etc. we assume will
>>> > have
>>> > > 50TB+ of data. The data is basically data warehouse style time series
>>> > data
>>> > > where much of it is cold, however want good read latency to get access
>>> to
>>> > > all of it. Not memory based latency but < 25ms latency for a small
>>> chunks
>>> > of
>>> > > data.
>>> > >
>>> > > How many nodes, regions, etc. are we going to need? Assuming a typical
>>> 6
>>> > > disk, 24GB ram, 16 core data node, how many of these do we need to
>>> > > sufficiently manage this volume of data? Obviously there are a million
>>> > "it
>>> > > depends", but the bigger drivers are how much data can a node handle?
>>> How
>>> > > long will compaction take? How many regions can a node handle and how
>>> big
>>> > > can those regions get? Can we really have 1.5TB of data on a single
>>> node
>>> > in
>>> > > 6,000 regions? What are the true drivers between more nodes vs. bigger
>>> > > nodes? Do we need 30 nodes to handle our 50GB of data or 100 nodes?
>>> What
>>> > > will our read latency be for 30 vs. 100? Sure we can pack 20 nodes
>>> with
>>> > 3TB
>>> > > of data each but will it take 1+s for every get? Will compaction run
>>> for
>>> > 3
>>> > > days? How much data is really "too much" on an hbase data node?
>>> > >
>>> > > Any help or advice would be greatly appreciated.
>>> > >
>>> > > Thanks
>>> > >
>>> > > Wayne
>>> > >
>>> >
>>>
>>
>>
>

Reply via email to