100 writes/updates per min is very low number and HBase, of course, is able to sustain 1.5 update/sec (if not GBs per update) 1000 concurrent users and minimum query latency - probably possible but we do not have enough info: What is SLA? requests per sec and latency requirements? How large is the typical result set?
You will definitely need to keep your hot data set in a RAM. If you can afford to store data twice and ACID transaction is not your MUST HAVE feature: Have two rows per your asset item: rowkey1: asset_key + update_time rowkey2: update_time + asset_key This basically, gives you 2 covered indexes: by asset_key and by update_time, but because you duplicate data you replaces many random look ups (as in case of a simple index) by one scan operation on a corresponding rowkeys. On asset update insert two rows into table (you can keep them in the same table) and make sure you have enough RAM (cache) to keep all in memory. Best regards, Vladimir Rodionov Principal Platform Engineer Carrier IQ, www.carrieriq.com e-mail: [email protected] ________________________________________ From: Steven Wu [[email protected]] Sent: Tuesday, December 10, 2013 3:35 PM To: [email protected] Subject: hbase schema design Hi I am very new to Hbase, still self-learning and do POC for our current project. I have a question about the row key design. I have created big table (called asset table), it has more than 50M records. Each asset has unique key (let's call it asset_key) This table receives continuous updates from up-stream system (around 100 updates per min). The clients would like to receive real-time updates from us. At current system, we have two indexed columns (asset_key, update_ts) on asset DB table So the clients could query the db table based on update_ts for lastest updates. However the db now become a bottleneck So we are wondering how could we achieve the same function in Hbase. I don't want to use scan filter function on the column as it will tiger full table scan (correct me if I am wrong on this). the best thing I could think of is to have timestamp built in to rowkey. However, we still have a requirement, that client would like query data based on unique asset_key The usercase we have is the system has to support concurrently more than 1000 uses to query latest update from this table at lowest possible latency. Also , clients would like query data based on unique asset_key to retrieve records from our system Really appreciate your though on this. Regards, Steven Confidentiality Notice: The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited. If you have received this message in error, please immediately notify the sender and/or [email protected] and delete or destroy any copy of this message and its attachments.
