You can use Phoenix + HBase and use index in Phoenix. But since you need 8 different kind of query, you may need to create 8 different indices and thus 8 index tables. But unlike Cassandra, you do not have to store all the column data in all tables redundantly. On the other hand, you can use non-covered index, making a simple mapping between the index column and the rowkey. So there won't be 8x space.
For the 2nd question. In HBase, there won't be a node join-remove problem, since the storage layer(using HDFS) and computing layer are completely separated. You don't have to move data if a HBase node joined in or moved out. For the 3rd question, please refer to Josh Elser in the previous relay, it is just a 'marketing trash', HBase is a high performance, low lantancy ONLINE storage system, which has already been massively used in many real-time production systems. Best Regards Allan Yang Josh Elser <[email protected]> 于2018年9月11日周二 下午9:26写道: > Please be patient in getting a response to questinos you post to this > list as we're all volunteers. > > On 9/8/18 2:16 AM, onmstester onmstester wrote: > > Hi, Currently I'm using Apache Cassandra as backend for my restfull > application. Having a cluster of 30 nodes (each having 12 cores, 64gb ram > and 6 TB disk which 50% of the disk been used) write and read throughput is > more than satisfactory for us. The input is a fixed set of long and int > columns which we need to query it based on every column, so having 8 > columns there should be 8 tables based on Cassandra query plan > recommendation. The cassandra keyspace schema would be someting like this: > Table 1 (timebucket,col1, ...,col8, primary key(timebuecket,col1)) to > handle select * from input where timebucket = X and col1 = Y .... Table 8 > (timebucket,col1, ...,col8, primary key(timebuecket,col8)) So for each > input row, there would be 8X insert in Cassandra (not considering RF) and > using TTL of 12 months, production cluster should keep about 2 Peta Bytes > of data With recommended node density for Cassandra cluster (2 TB per > node), i need a cluster with more than 1000 nodes (which i can not afford) > So long story short: I'm looking for an alternative to Apache Cassandra for > this application. How HBase would solve these problem: > > > > 1. 8X data redundancy due to needed queries > > HBase provides one intrinsic "index" over the data in your table and > that is the "rowkey". If you need to access the same data 8 different > ways, you would need to come up with 8 indexes. > > FWIW, this is not what I commonly see. Usually there are 2 or 3 lookups > that need to happen in the "fast path", not 8. Perhaps you need to take > another look at your application needs? > > > 2. nodes with large data density (30 TB data on each node if No.1 could > not be solved in HBase), how HBase would handle compaction and node > join-remove problems while there is only 5 * 6 TB 7200 SATA Disk available > on each node? How much Hbase needs as empty space for template files of > compaction? > > HBase uses a distributed filesystem to ensure that data is available to > be read by any RegionServer. Obviously, that filesystem needs to have > sufficient capacity to write a new file which is approximately the sum > of the file sizes being compacted. > > > 3. Also i read in some documents (including datastax's) that HBase is > more > of a offline & data-lake backend that better not to be used as web > application backendd which needs less than some seconds QoS in response > time. Thanks in advance Sent using Zoho Mail > > Sounds like marketing trash to me. The entire premise around HBase's > architecture is: > > * Low latency random writes/updates > * Low latency random reads > * High throughput writes via batch tools (e.g. Bulk loading) > > IIRC, many early adopters of HBase were using it in the critical-path > for web applications. >
