Re: How to Rank in HBase?

Ian Varley Sun, 29 Jan 2012 22:37:19 -0800

Bing,

HBase uses an approach to structuring its storage known as "Log Structured 
Merge Trees", which you can learn more about here:


http://scholar.google.com/scholar?q=log+structured+merge+tree&hl=en&as_sdt=0&as_vis=1&oi=scholart

As well as in Lars George's great book, here:

http://shop.oreilly.com/product/0636920014348.do

It does all of these "frequent updates" just in memory, which is very fast; at 
the same time, it writes a simple forward-only log of all edits (known as the 
Write Ahead Log, or WAL) to disk in order to provide durability in the event of 
machine failure. It periodically writes the in-memory data to disk in big 
immutable ordered chunks, called "store files", which is very efficient. Future 
reads of the data then "merge" the on-disk store file data with the current 
state in memory, to get the full picture of the state of any row. Over time, 
the many small store files get "compacted" into bigger files, so that 
individual reads don't have too many files to read from. Each "get" or "scan" 
operation can just read small blocks of the store files; when you ask for one 
record, it doesn't have to read gigabytes of data from the disk, it can just 
read a small block. As such, random small reads and writes on a very big data 
set can be done efficiently.

Furthermore, it's fine to update the data store frequently. For any given 
record, you can make as many updates as you want to the in-memory structures, 
and these aren't written to disk until the memory store is flushed (and into 
the WAL, but that's also efficient b/c it's ordered by update time, not record 
key). It all happens in memory, which is very fast (but, again, it's safe b/c 
of the WAL). There are even some recent JIRAs that make that process more 
efficient, by, for example, 
HBASE-4241<https://issues.apache.org/jira/browse/HBASE-4241>.

One way to think about it is that HBase is *precisely* a layer that adds these 
efficient random read/write capabilities on top of the Hadoop distributed file 
system (HDFS), and takes care of doing that in a way that parallelizes nicely 
across a large cluster of machines, deals with machine failures, etc.

Ian

On Jan 29, 2012, at 10:16 PM, Bing Li wrote:

Dear Stack,

Thanks so much for your reply!

According to my understanding, in a large scale distributed system, it
prefers write-once-read-many. Frequent-updating must bring heavy load for
the consistency issue and the performance must be lowered. HBase must not
be suitable to be updated frequently, right?

Best regards,
Bing

On Mon, Jan 30, 2012 at 1:51 PM, Stack 
<[email protected]<mailto:[email protected]>> wrote:

On Sun, Jan 29, 2012 at 12:02 PM, Bing Li 
<[email protected]<mailto:[email protected]>> wrote:
Another question is whether it is proper to update data in HBase
frequently?


This is 'normal', yes.
St.Ack

Re: How to Rank in HBase?

Reply via email to