+1 on Lars comment. Either the client gets the rowkey from secondary table and then gets the real data from Primary Table. ** OR ** Send the request to all the RS(or region) hosting a region of primary table.
Anoop is using the latter mechanism. Both the mechanism have their pros and cons. IMO, there is no outright winner. ~Anil Gupta On Tue, Jan 8, 2013 at 4:30 PM, lars hofhansl <[email protected]> wrote: > Different use cases. > > > For global point queries you want exactly what you said below. > For range scans across many rows you want Anoop's design. As usually it > depends. > > > The tradeoff is bringing a lot of unnecessary data to the client vs having > to contact each region (or at least each region server). > > > -- Lars > > > > ________________________________ > From: Michael Segel <[email protected]> > To: [email protected] > Sent: Tuesday, January 8, 2013 6:33 AM > Subject: Re: HBase - Secondary Index > > So if you're using an inverted table / index why on earth are you doing it > at the region level? > > I've tried to explain this to others over 6 months ago and its not really > a good idea. > > You're over complicating this and you will end up creating performance > bottlenecks when your secondary index is completely orthogonal to your row > key. > > To give you an example... > > Suppose you're CCCIS and you have a large database of auto insurance > claims that you've acquired over the years from your Pathways product. > > Your primary key would be a combination of the Insurance Company's ID and > their internal claim ID for the individual claim. > Your row would be all of the data associated to that claim. > > So now lets say you want to find the average cost to repair a front end > collision of an S80 Volvo. > The make and model of the car would be orthogonal to the initial key. This > means that the result set containing insurance records for Front End > collisions of S80 Volvos would be most likely evenly distributed across the > cluster's regions. > > If you used a series of inverted tables, you would be able to use a series > of get()s to get the result set from each index and then find their > intersections. (Note that you could also put them in sort order so that the > intersections would be fairly straight forward to find. > > Doing this at the region level isn't so simple. > > So I have to again ask why go through and over complicate things? > > Just saying... > > On Jan 7, 2013, at 7:49 AM, Anoop Sam John <[email protected]> wrote: > > > Hi, > > It is inverted index based on column(s) value(s) > > It will be region wise indexing. Can work when some one knows the rowkey > range or NOT. > > > > -Anoop- > > ________________________________________ > > From: Mohit Anchlia [[email protected]] > > Sent: Monday, January 07, 2013 9:47 AM > > To: [email protected] > > Subject: Re: HBase - Secondary Index > > > > Hi Anoop, > > > > Am I correct in understanding that this indexing mechanism is only > > applicable when you know the row key? It's not an inverted index truly > > based on the column value. > > > > Mohit > > On Sun, Jan 6, 2013 at 7:48 PM, Anoop Sam John <[email protected]> > wrote: > > > >> Hi Adrien > >> We are making the consistency btw the main table and > >> index table and the roll back mentioned below etc using the CP hooks. > The > >> current hooks were not enough for those though.. I am in the process of > >> trying to contribute those new hooks, core changes etc now... Once all > are > >> done I will be able to explain in details.. > >> > >> -Anoop- > >> ________________________________________ > >> From: Adrien Mogenet [[email protected]] > >> Sent: Monday, January 07, 2013 2:00 AM > >> To: [email protected] > >> Subject: Re: HBase - Secondary Index > >> > >> Nice topic, perhaps one of the most important for 2013 :-) > >> I still don't get how you're ensuring consistency between index table > and > >> main table, without an external component (such as > bookkeeper/zookeeper). > >> What's the exact write path in your situation when inserting data ? > >> (WAL/RegionObserver, pre/post put/WALedit...) > >> > >> The underlying question is about how you're ensuring that WALEdit in > Index > >> and Main tables are perfectly sync'ed, and how you 're able to rollback > in > >> case of issue in both WAL ? > >> > >> > >> On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <[email protected]> > >> wrote: > >> > >>>> Yes as you say when the no of rows to be returned is becoming more and > >>> more the latency will be becoming more. seeks within an HFile block is > >>> some what expensive op now. (Not much but still) The new encoding > >>> prefix > >>> trie will be a huge bonus here. There the seeks will be flying.. [Ted > >> also > >>> presented this in the Hadoop China] Thanks to Matt... :) I am trying > to > >>> measure the scan performance with this new encoding . Trying to >back > >> port > >>> a simple patch for 94 version just for testing... Yes when the no of > >>> results to be returned is more and more any index will become less > >>> performing as per my study :) > >>> > >>> yes, you are right, I guess it's just a drawback of any index approach. > >>> Thanks for the explanation. > >>> > >>> Shengjie > >>> > >>> On 28 December 2012 04:14, Anoop Sam John <[email protected]> wrote: > >>> > >>>>> Do you have link to that presentation? > >>>> > >>>> http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf > >>>> > >>>> -Anoop- > >>>> > >>>> ________________________________________ > >>>> From: Mohit Anchlia [[email protected]] > >>>> Sent: Friday, December 28, 2012 9:12 AM > >>>> To: [email protected] > >>>> Subject: Re: HBase - Secondary Index > >>>> > >>>> On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John <[email protected]> > >>>> wrote: > >>>> > >>>>> Yes as you say when the no of rows to be returned is becoming more > >> and > >>>>> more the latency will be becoming more. seeks within an HFile block > >> is > >>>>> some what expensive op now. (Not much but still) The new encoding > >>> prefix > >>>>> trie will be a huge bonus here. There the seeks will be flying.. [Ted > >>>> also > >>>>> presented this in the Hadoop China] Thanks to Matt... :) I am > >> trying > >>> to > >>>>> measure the scan performance with this new encoding . Trying to back > >>>> port a > >>>>> simple patch for 94 version just for testing... Yes when the no of > >>>>> results to be returned is more and more any index will become less > >>>>> performing as per my study :) > >>>>> > >>>>> Do you have link to that presentation? > >>>> > >>>> > >>>>>> btw, quick question- in your presentation, the scale there is > >> seconds > >>> or > >>>>> mill-seconds:) > >>>>> > >>>>> It is seconds. Dont consider the exact values. What is the % of > >>> increase > >>>>> in latency is important :) Those were not high end machines. > >>>>> > >>>>> -Anoop- > >>>>> ________________________________________ > >>>>> From: Shengjie Min [[email protected]] > >>>>> Sent: Thursday, December 27, 2012 9:59 PM > >>>>> To: [email protected] > >>>>> Subject: Re: HBase - Secondary Index > >>>>> > >>>>>> Didnt follow u completely here. There wont be any get() happening.. > >>> As > >>>>> the > >>>>>> exact rowkey in a region we get from the index table, we can seek to > >>> the > >>>>>> exact position and return that row. > >>>>> > >>>>> Sorry, When I misused "get()" here, I meant seeking. Yes, if it's > >> just > >>>>> small number of rows returned, this works perfect. As you said you > >> will > >>>> get > >>>>> the exact rowkey positions per region, and simply seek them. I was > >>> trying > >>>>> to work out the case that when the number of result rows increases > >>>>> massively. Like in Anil's case, he wants to do a scan query against > >> the > >>>>> 2ndary index(timestamp): "select all rows from timestamp1 to > >>> timestamp2" > >>>>> given no customerId provided. During that time period, he might have > >> a > >>>> big > >>>>> chunk of rows from different customerIds. The index table returns a > >> lot > >>>> of > >>>>> rowkey positions for different customerIds (I believe they are > >>> scattered > >>>> in > >>>>> different regions), then you end up seeking all different positions > >> in > >>>>> different regions and return all the rows needed. According to your > >>>>> presentation page14 - Performance Test Results (Scan), without index, > >>>> it's > >>>>> a linear increase as result rows # increases. on the other hand, with > >>>>> index, time spent climbs up way quicker than the case without index. > >>>>> > >>>>> btw, quick question- in your presentation, the scale there is seconds > >>> or > >>>>> mill-seconds:) > >>>>> > >>>>> - Shengjie > >>>>> > >>>>> > >>>>> On 27 December 2012 15:54, Anoop John <[email protected]> wrote: > >>>>> > >>>>>>> how the massive number of get() is going to > >>>>>> perform againt the main table > >>>>>> > >>>>>> Didnt follow u completely here. There wont be any get() happening.. > >>> As > >>>>> the > >>>>>> exact rowkey in a region we get from the index table, we can seek > >> to > >>>> the > >>>>>> exact position and return that row. > >>>>>> > >>>>>> -Anoop- > >>>>>> > >>>>>> On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min < > >> [email protected]> > >>>>>> wrote: > >>>>>> > >>>>>>> how the massive number of get() is going to > >>>>>>> perform againt the main table > >>>>>>> > >>>>>> > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> All the best, > >>>>> Shengjie Min > >>>>> > >>>> > >>> > >>> > >>> > >>> -- > >>> All the best, > >>> Shengjie Min > >>> > >> > >> > >> > >> -- > >> Adrien Mogenet > >> 06.59.16.64.22 > >> http://www.mogenet.me > >> -- Thanks & Regards, Anil Gupta
