We have managed to get a little more than 1k QPS to date with 10 nodes.
Honestly we are not quite convinced that disk i/o seeks are our biggest
bottleneck. Of course they should be...but waiting for RPC connections,
network latency, thrift etc. all play into the time to get reads. The std
dev. of read time is way too high for our comfort, but os cache and other
things make it hard to benchmark this. Our average read time has jumped to
~65ms which is a lot more than we expected (we expected 30-40ms). Our reads
are not as small as originally thought, but the 65ms still seems high. We
would love to have added a scanOpenWithStop single rpc read call (open, get
all rows, close scanner in one rpc call). We are moving to 10k disks (5 data
disks per node) so once we get some of these nodes in place we can see how
things compare. I suspect things won't change much...which will confirm disk
i/o is not our only bottleneck.  I will be happy to be wrong...

Based on the performance we have seen we expect to need 40 nodes. With some
fine tuning the 40 nodes can deliver 5k QPS which is what we need to run our
application processes. We have a substantial write load as well, so our
planning and numbers allow spare cycles for dealing with significant writes
at the same time. We have cache set way low to just be used by .META. and
all our tables have cache turned off. Memcached will be sitting on top to
help client access. We are also using a 5Gb region size to keep our region
counts in the 100-200 range/node per Jonathan Grey's recommendation.

We have thought about going to infiniband or 10g, but with our smaller node
sizes we don't think it will make much difference. The cost of infiniband
for all 40 nodes buys us another 6 nodes...money better spent on scaling
out.


On Fri, Feb 18, 2011 at 2:15 AM, M. C. Srivas <[email protected]> wrote:

> I was reading this thread with interest. Here's my $.02
>
> On Fri, Dec 17, 2010 at 12:29 PM, Wayne <[email protected]> wrote:
>
>> Sorry, I am sure my questions were far too broad to answer.
>>
>> Let me *try* to ask more specific questions. Assuming all data requests
>> are
>> cold (random reading pattern) and everything comes from the disks (no
>> block
>> cache), what level of concurrency can HDFS handle?
>
>
> Cold cache, random reads ==> totally governed by seeks, so governed by # of
> spindles per box.
>
> A SATA drive can do about 100 random seeks per sec, ie, 100 reads/second
>
>
>
>> Almost all of the load is
>> controlled data processing, but we have to do a lot of work at night
>> during
>> the batch window so something in the 15-20,000 QPS range would meet
>> current
>> worse case requirements. How many nodes would be required to effectively
>> return data against a 50TB aggregate data store?
>
>
> Assuming 12 drives per node, and a cache hit rate of 0% (since its most
> cold cache), you will see about 12 * 100  = 1200 reads per second per node.
> If your cache hit rate goes up to 25%, then, your read rate is 1600
> reads/sec/node
> Thus, 10 machines can serve about 12k-16k reads/sec, cold cache.
>
> 50TB of data on 10 machine => 5 TB/node. That might be bit too much for
> each region server (a RS can do about 700 regions comfortably, each of 1G).
> If you push, you might get 2TB/regionserver, or, 25 machines for total. If
> the data compresses 50%, then 12-13 nodes.
>
> So, for your scenario, its about 12-13 RS's, with 12 drives each, and you
> will comfortably do 24k QPS cold cache.
>
> Does that help?
>
>
>
>> Disk I/O assumedly starts
>> to break down at a certain point in terms of concurrent readers/node/disk.
>> We have in our control how many total concurrent readers there are, so if
>> we
>> can get 10ms response time with 100 readers that might be better than
>> 100ms
>> responses from 1000 concurrent readers.
>>
>> Thanks.
>>
>>
>> On Fri, Dec 17, 2010 at 2:46 PM, Jean-Daniel Cryans <[email protected]
>> >wrote:
>>
>> > Hi Wayne,
>> >
>> > This question has such a large scope but is applicable to such a tiny
>> > subset of workloads (eg yours) that fielding all the questions in
>> > details would probably end up just wasting everyone's cycles. So first
>> > I'd like to clear up some confusion.
>> >
>> > > We would like some help with cluster sizing estimates. We have 15TB of
>> > > currently relational data we want to store in hbase.
>> >
>> > There's the 3x replication factor, but also you have to account that
>> > each value is stored with it's row key, family name, qualifier and
>> > timestamp. That could be a lot more data to store, but at the same
>> > time you can use LZO compression to bring that down ~4x.
>> >
>> > > How many nodes, regions, etc. are we going to need?
>> >
>> > You don't really have the control over regions, they are created for
>> > you as your data grows.
>> >
>> > > What will our read latency be for 30 vs. 100? Sure we can pack 20
>> nodes
>> > with 3TB
>> > > of data each but will it take 1+s for every get?
>> >
>> > I'm not sure what kind of back-of-the-envelope calculations took you
>> > to 1sec, but latency will be strictly determined by concurrency and
>> > actual machine load. Even if you were able to pack 20TB in one onde
>> > but using a tiny portion of it, you would still get sub 100ms
>> > latencies. Or if you have only 10GB on that node but it's getting
>> > hammered by 10000 clients, then you should expect much higher
>> > latencies.
>> >
>> > > Will compaction run for 3 days?
>> >
>> > Which compactions? Major ones? If you don't insert new data in a
>> > region, it won't be major compacted. Also if you have that much data,
>> > I would set the time between major compactions to be bigger than 1
>> > day. Heck, since you are doing time series, this means you'll never
>> > delete anything right? So you might as well disable them.
>> >
>> > And now for the meaty part...
>> >
>> > The size of your dataset is only one part of the equation, the other
>> > being traffic you would be pushing to the cluster which I think wasn't
>> > covered at all in your email. Like I said previously, you can pack a
>> > lot of data in a single node and can retrieve it really fast as long
>> > as concurrency is low. Another thing is how random your reading
>> > pattern is... can you even leverage the block cache at all? If yes,
>> > then you can accept more concurrency, if not then hitting HDFS is a
>> > lot slower (and it's still not very good at handling many clients).
>> >
>> > Unfortunately, even if you gave us exactly how many QPS you want to do
>> > per second, we'd have a hard time recommending any number of nodes.
>> >
>> > What I would recommend then is to benchmark it. Try to grab 5-6
>> > machines, load a subset of the data, generate traffic, see how it
>> > behaves.
>> >
>> > Hope that helps,
>> >
>> > J-D
>> >
>> > On Fri, Dec 17, 2010 at 9:09 AM, Wayne <[email protected]> wrote:
>> > > We would like some help with cluster sizing estimates. We have 15TB of
>> > > currently relational data we want to store in hbase. Once that is
>> > replicated
>> > > to a factor of 3 and stored with secondary indexes etc. we assume will
>> > have
>> > > 50TB+ of data. The data is basically data warehouse style time series
>> > data
>> > > where much of it is cold, however want good read latency to get access
>> to
>> > > all of it. Not memory based latency but < 25ms latency for a small
>> chunks
>> > of
>> > > data.
>> > >
>> > > How many nodes, regions, etc. are we going to need? Assuming a typical
>> 6
>> > > disk, 24GB ram, 16 core data node, how many of these do we need to
>> > > sufficiently manage this volume of data? Obviously there are a million
>> > "it
>> > > depends", but the bigger drivers are how much data can a node handle?
>> How
>> > > long will compaction take? How many regions can a node handle and how
>> big
>> > > can those regions get? Can we really have 1.5TB of data on a single
>> node
>> > in
>> > > 6,000 regions? What are the true drivers between more nodes vs. bigger
>> > > nodes? Do we need 30 nodes to handle our 50GB of data or 100 nodes?
>> What
>> > > will our read latency be for 30 vs. 100? Sure we can pack 20 nodes
>> with
>> > 3TB
>> > > of data each but will it take 1+s for every get? Will compaction run
>> for
>> > 3
>> > > days? How much data is really "too much" on an hbase data node?
>> > >
>> > > Any help or advice would be greatly appreciated.
>> > >
>> > > Thanks
>> > >
>> > > Wayne
>> > >
>> >
>>
>
>

Reply via email to