I think that's one of the questions we're all starting to face. At my current client, using a secondary index that is embedded in to the underlying table structure has extreme value. It really helps when you want to set up a m/r scanner() filtering against something that is not the row key or is a sparse value. An example... my row key is a unique record value. But in batch processing I want to filter on a specific job_id. So if I create a secondary index on job_id, I'll have fewer rows to process when compared to a full scan w a filter on job_id.
At the same time, if one wants to allow ad hoc queries against arbitrary data ... you'll want a different type of index. So the value depends on how you want to use Hadoop/HBase. The problem is that you have to pull the code off git hub and build it against your version of hbase. (0.89 or 0.20.5) > Date: Wed, 28 Jul 2010 17:05:46 -0400 > Subject: Extending RegionServer for Indexing or using the Client? > From: [email protected] > To: [email protected] > > Hi, > > I'm currently looking intensively into indexing for HBase. The Indexer > maintained on http://github.com/hbase-trx/hbase-transactional-tableindexed > extends the RegionServer and thus the client just defines the Index > and then adds one Put with the record towards HBase. The rest is taken > care on the Region side by the derived class. > > What do you guys say - does it pay out to implement the Indexing (and > maybe some other opperations that result in a put) on the Region side > or rather create the Indexer "outside" HBase and then push for > instance two Puts() towards HBase? I saw that Lily is doing the > Client-Side way. > > > Thx for the great support! > > /SJ
