Hi there, One thing to mention about the BigTable paper is they reverse the URL so that scans work with subdomains.
www.subdomain1.cnn.com -> com.cnn.subdomain1.www www.subdomain2.cnn.com -> com.cnn.subdomain2.www If you don't reverse the URL there isn't an easy scan (short of creating another table to act as an index) for all the URLs under a domain. Regarding the good question below about use-cases, the RefGuide says in 6.3.2.3 "Keep them as short as is reasonable such that they can still be useful for required data access". Shorter rowkeys is usually a good thing, but shorter isn't better if it doesn't work for what you are trying to do. :-) On 8/29/13 10:18 AM, "Shahab Yunus" <[email protected]> wrote: >What advantage you will be gaining by compressing? Less space? But then it >will add compression/decompression performance overhead. A trade-off but a >especially significant as space is cheap and redundancy is OK with such >data stores. > >Having said that, more importantly, what are your read use-cases or access >patterns? That should drive your decision about row key design. > >Regards, >Shahab > > >On Thu, Aug 29, 2013 at 5:21 AM, Wasim Karani ><[email protected]>wrote: > >> I am using HBase to store webtable content like how google is using >> bigtable. >> For reference of google bigtable >> My question is on RowKey, how we should be forming it. >> What google is doing is saving the URL in a reverse order as you can >>see in >> the PDF document "com.cnn.www" so that all the links associated with >> cnn.com >> will be manages in same block of GFS which will be lot easier to scan. >> I can use the same thing as google is using but wont it will be cool if >>I >> use >> some algorithm to compress the url >> >> For eg. >> >> RewKey | Google Bigtable >> | Algorithm output >> www.cnn.com/index.php | com.cnn.www/index.php >> | 12as/435 >> www.cnn.com/news/business/index.html | >> com.cnn.www/news/business/index.html >> | 12as/2as/dcx/asd >> www.cnn.com/news/sports/index.html | >>com.cnn.www/news/sports/index.html >> | 12as/2as/eds/scf >> Reason behind doing this is rowkey will be shorter as per the Hbase >>design >> schema (Mentioned in topic 6.3.2.3. Rowkey Length). >> >> So what do I need from you guys is to know am I correct over here.... >> Also if I am correct what Algorithm I should using. I am using python >>over >> thrift as a programming language so code will be overwhelming for me... >> >>
