I guess one reason is the the amount of data traveling. In your design, you have to query a secondary index table, read all the matched original table row keys, send them back to the client, and then issue a special scan that retrieves only those row keys values. In his example, he retrieved 2% of the data which was around 10 million records, which is around 1 GB according his key size (800 bytes). That's a lot of bytes being transferred and throttling your switches. In hi design you read the rowkeys locally, thus able to apply the rest of the filters, and may eventually return just 100 key values which matches to those extra filters. Thus saving tons of bandwidth and lots of rpc calls. In your example, and using his design, you can treat each region as mini table, each indexing its own data.
Having a secondary indexing solution which also supports join like any RDBMS does as yet to be found since its fairly complex. On Tuesday, January 8, 2013, Michael Segel wrote: > So if you're using an inverted table / index why on earth are you doing it > at the region level? > > I've tried to explain this to others over 6 months ago and its not really > a good idea. > > You're over complicating this and you will end up creating performance > bottlenecks when your secondary index is completely orthogonal to your row > key. > > To give you an example... > > Suppose you're CCCIS and you have a large database of auto insurance > claims that you've acquired over the years from your Pathways product. > > Your primary key would be a combination of the Insurance Company's ID and > their internal claim ID for the individual claim. > Your row would be all of the data associated to that claim. > > So now lets say you want to find the average cost to repair a front end > collision of an S80 Volvo. > The make and model of the car would be orthogonal to the initial key. This > means that the result set containing insurance records for Front End > collisions of S80 Volvos would be most likely evenly distributed across the > cluster's regions. > > If you used a series of inverted tables, you would be able to use a series > of get()s to get the result set from each index and then find their > intersections. (Note that you could also put them in sort order so that the > intersections would be fairly straight forward to find. > > Doing this at the region level isn't so simple. > > So I have to again ask why go through and over complicate things? > > Just saying... > > On Jan 7, 2013, at 7:49 AM, Anoop Sam John <[email protected]> wrote: > > > Hi, > > It is inverted index based on column(s) value(s) > > It will be region wise indexing. Can work when some one knows the rowkey > range or NOT. > > > > -Anoop- > > ________________________________________ > > From: Mohit Anchlia [[email protected]] > > Sent: Monday, January 07, 2013 9:47 AM > > To: [email protected] > > Subject: Re: HBase - Secondary Index > > > > Hi Anoop, > > > > Am I correct in understanding that this indexing mechanism is only > > applicable when you know the row key? It's not an inverted index truly > > based on the column value. > > > > Mohit > > On Sun, Jan 6, 2013 at 7:48 PM, Anoop Sam John <[email protected]> > wrote: > > > >> Hi Adrien > >> We are making the consistency btw the main table and > >> index table and the roll back mentioned below etc using the CP hooks. > The > >> current hooks were not enough for those though.. I am in the process of > >> trying to contribute those new hooks, core changes etc now... Once all > are > >> done I will be able to explain in details.. > >> > >> -Anoop- > >> ________________________________________ > >> From: Adrien Mogenet [[email protected]] > >> Sent: Monday, January 07, 2013 2:00 AM > >> To: [email protected] > >> Subject: Re: HBase - Secondary Index > >> > >> Nice topic, perhaps one of the most important for 2013 :-) > >> I still don't get how you're ensuring consistency between index table > and > >> main table, without an external component (such as > bookkeeper/zookeeper). > >> What's the exact write path in your situation when inserting data ? > >> (WAL/RegionObserver, pre/post put/WALedit...) > >> > >> The underlying question is about how you're ensuring that WALEdit in > Index > >> and Main tables are perfectly sync'ed, and how you 're able to rollback > in > >> case of issue in both WAL ? > >> > >> > >> On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <[email protected]> > >> wrote: > >> > >>>> Yes as you say when the no of rows to be returned is becoming more and > >>> more the latency will be becoming more. seeks within an HFile block is > >>> some what expensive op now. (Not much but still) The new encoding > >>> prefix > >>> trie will be a huge bonus here. There the seeks will be flying.. [Ted > >> also > >>> presented this in the Hadoop China] Thanks to Matt... :) I am trying > to > >>> measure the scan performance with this new encoding . Trying to >back > >> port > >>> a simple patch for 94 version just for testing... Yes when the no of > >>> results to be returned is more and more any index will become less > >>> performing as per my study :) > >>> > >>> yes, you are right, I guess it's just a drawback of any index approach. > >>> Thanks for the explanation. > >>> > >>> Shengjie > >>> > >>> On 28 December 2012 04:14, Anoop Sam John <[email protected]> wrote: > >>> > >>>>> Do you have link to that presentation? > >>>> > >>>> http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf > >>>> > >>>> -Anoop- > >>>> > >>>> ________________________________________ > >>>> From: Mohit Anchlia [[email protected]] > >>>> Sent: Friday, December 28, 2012 9:12 AM > >>>> To: [email protected] > >>>
