We would like some help with cluster sizing estimates. We have 15TB of currently relational data we want to store in hbase. Once that is replicated to a factor of 3 and stored with secondary indexes etc. we assume will have 50TB+ of data. The data is basically data warehouse style time series data where much of it is cold, however want good read latency to get access to all of it. Not memory based latency but < 25ms latency for a small chunks of data.
How many nodes, regions, etc. are we going to need? Assuming a typical 6 disk, 24GB ram, 16 core data node, how many of these do we need to sufficiently manage this volume of data? Obviously there are a million "it depends", but the bigger drivers are how much data can a node handle? How long will compaction take? How many regions can a node handle and how big can those regions get? Can we really have 1.5TB of data on a single node in 6,000 regions? What are the true drivers between more nodes vs. bigger nodes? Do we need 30 nodes to handle our 50GB of data or 100 nodes? What will our read latency be for 30 vs. 100? Sure we can pack 20 nodes with 3TB of data each but will it take 1+s for every get? Will compaction run for 3 days? How much data is really "too much" on an hbase data node? Any help or advice would be greatly appreciated. Thanks Wayne
