I'm still kinda new to HBase so please excuse me if I am wrong. I suspect the reason has to do with a different slide from their presentation where they run a job every hour to combine all the cells from the previous hour into one cell.
OpenTSDB has quite a long row key. It contains the metric name, the timestamp, and numerous optional tags. If you wrote one metric every second then you would write 3600 columns per row key. Since the row key is very long, it uses quite a bit of space to store the same row key 3600 times. By combining an hours worth of data into one cell OpenTMS claims they save 4-8x of their storage. If they stayed with their original 10 minute time slice then they would have to store their giant row key 6 times per hour instead of once. I'm going to guess this On Aug 27, 2013 10:50 PM, "林煒清" <[email protected]> wrote: > *Context*: > > Recently, I see openTSDB having their rows packed by period, thus end in > ten to hundred columns per row. It claim that this design performs more > efficient for row seeking.(on slide:Lessons learned from openTSDB) > > *My argument*: > > If *a block of HFile *is indexed by the start key of itself, which the key > is made of {row, cf, cq} , then I think read time for the specific Key > should be the same for all tall-or-wide table case, since the physical > storage is sorted by key, not only by rowkey. > > So that under one column family the rowkey+column is a key as a whole, > shift a part of the rowkey to the column is the same as shift a part of > rowkey to the tail of the rowkey, vice versa. > > Follow this logic , under physical view the openTSDB did is just change key > index by shifting a portion of timestamp bytes to position behind rowkey, > that is column qualifier. > > *Question*: > > 1.When getting (get is a special scan, right?) a packed row worth of one > hour, or scan over one hour range of rows, I don't see there could any > performance improvement. So why openTSDB says packed row have better > performance for row seeking? > > 2.Almost every doc & books all recommend tall table design and especially > at book "HBase in Action", it says that ,among others, the consideration of > reading performance is the reason why tall is adopting, though I still > can't get it why? > > 3.Also that the KeyValues inside a block is searched by *linear* scan, and > start key of blocks is by binary search , right? > > any hint is much appreciated. >
