We would like some help with cluster sizing estimates. We have 15TB of
currently relational data we want to store in hbase. Once that is replicated
to a factor of 3 and stored with secondary indexes etc. we assume will have
50TB+ of data. The data is basically data warehouse style time series data
where much of it is cold, however want good read latency to get access to
all of it. Not memory based latency but < 25ms latency for a small chunks of
data.

How many nodes, regions, etc. are we going to need? Assuming a typical 6
disk, 24GB ram, 16 core data node, how many of these do we need to
sufficiently manage this volume of data? Obviously there are a million "it
depends", but the bigger drivers are how much data can a node handle? How
long will compaction take? How many regions can a node handle and how big
can those regions get? Can we really have 1.5TB of data on a single node in
6,000 regions? What are the true drivers between more nodes vs. bigger
nodes? Do we need 30 nodes to handle our 50GB of data or 100 nodes? What
will our read latency be for 30 vs. 100? Sure we can pack 20 nodes with 3TB
of data each but will it take 1+s for every get? Will compaction run for 3
days? How much data is really "too much" on an hbase data node?

Any help or advice would be greatly appreciated.

Thanks

Wayne

Reply via email to