As far as i can see its more related to using the coprocessor framework in this soln that helps us in a great way to avoid unnecessary RPC calls when we go with Region level indexing.
Regards Ram On Wed, Jan 9, 2013 at 8:52 AM, Anoop Sam John <[email protected]> wrote: > Totally agree with Lars. The design came up as per our usage and data > distribution style etc. > Also the put performance we were not able to compromise. That is why the > region collocation based region based indexing design came :) > Also as we are having the indexing and index usage every thing happening > at server side, there is no need for any change in the client part > depending on what type of client u use. Java code or REST APIs or any > thing. Also MR based parallel scans any thing can be comparably easy I > feel as there is absolutely no changes needed at client side. :) > > As Anil said there will be pros and cons for every way and which one suits > your usage, needs to be adopted. :) > > -Anoop- > ________________________________________ > From: anil gupta [[email protected]] > Sent: Wednesday, January 09, 2013 6:58 AM > To: [email protected]; lars hofhansl > Subject: Re: HBase - Secondary Index > > +1 on Lars comment. > > Either the client gets the rowkey from secondary table and then gets the > real data from Primary Table. ** OR ** Send the request to all the RS(or > region) hosting a region of primary table. > > Anoop is using the latter mechanism. Both the mechanism have their pros and > cons. IMO, there is no outright winner. > > ~Anil Gupta > > On Tue, Jan 8, 2013 at 4:30 PM, lars hofhansl <[email protected]> wrote: > > > Different use cases. > > > > > > For global point queries you want exactly what you said below. > > For range scans across many rows you want Anoop's design. As usually it > > depends. > > > > > > The tradeoff is bringing a lot of unnecessary data to the client vs > having > > to contact each region (or at least each region server). > > > > > > -- Lars > > > > > > > > ________________________________ > > From: Michael Segel <[email protected]> > > To: [email protected] > > Sent: Tuesday, January 8, 2013 6:33 AM > > Subject: Re: HBase - Secondary Index > > > > So if you're using an inverted table / index why on earth are you doing > it > > at the region level? > > > > I've tried to explain this to others over 6 months ago and its not really > > a good idea. > > > > You're over complicating this and you will end up creating performance > > bottlenecks when your secondary index is completely orthogonal to your > row > > key. > > > > To give you an example... > > > > Suppose you're CCCIS and you have a large database of auto insurance > > claims that you've acquired over the years from your Pathways product. > > > > Your primary key would be a combination of the Insurance Company's ID and > > their internal claim ID for the individual claim. > > Your row would be all of the data associated to that claim. > > > > So now lets say you want to find the average cost to repair a front end > > collision of an S80 Volvo. > > The make and model of the car would be orthogonal to the initial key. > This > > means that the result set containing insurance records for Front End > > collisions of S80 Volvos would be most likely evenly distributed across > the > > cluster's regions. > > > > If you used a series of inverted tables, you would be able to use a > series > > of get()s to get the result set from each index and then find their > > intersections. (Note that you could also put them in sort order so that > the > > intersections would be fairly straight forward to find. > > > > Doing this at the region level isn't so simple. > > > > So I have to again ask why go through and over complicate things? > > > > Just saying... > > > > On Jan 7, 2013, at 7:49 AM, Anoop Sam John <[email protected]> wrote: > > > > > Hi, > > > It is inverted index based on column(s) value(s) > > > It will be region wise indexing. Can work when some one knows the > rowkey > > range or NOT. > > > > > > -Anoop- > > > ________________________________________ > > > From: Mohit Anchlia [[email protected]] > > > Sent: Monday, January 07, 2013 9:47 AM > > > To: [email protected] > > > Subject: Re: HBase - Secondary Index > > > > > > Hi Anoop, > > > > > > Am I correct in understanding that this indexing mechanism is only > > > applicable when you know the row key? It's not an inverted index truly > > > based on the column value. > > > > > > Mohit > > > On Sun, Jan 6, 2013 at 7:48 PM, Anoop Sam John <[email protected]> > > wrote: > > > > > >> Hi Adrien > > >> We are making the consistency btw the main table and > > >> index table and the roll back mentioned below etc using the CP hooks. > > The > > >> current hooks were not enough for those though.. I am in the process > of > > >> trying to contribute those new hooks, core changes etc now... Once > all > > are > > >> done I will be able to explain in details.. > > >> > > >> -Anoop- > > >> ________________________________________ > > >> From: Adrien Mogenet [[email protected]] > > >> Sent: Monday, January 07, 2013 2:00 AM > > >> To: [email protected] > > >> Subject: Re: HBase - Secondary Index > > >> > > >> Nice topic, perhaps one of the most important for 2013 :-) > > >> I still don't get how you're ensuring consistency between index table > > and > > >> main table, without an external component (such as > > bookkeeper/zookeeper). > > >> What's the exact write path in your situation when inserting data ? > > >> (WAL/RegionObserver, pre/post put/WALedit...) > > >> > > >> The underlying question is about how you're ensuring that WALEdit in > > Index > > >> and Main tables are perfectly sync'ed, and how you 're able to > rollback > > in > > >> case of issue in both WAL ? > > >> > > >> > > >> On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <[email protected]> > > >> wrote: > > >> > > >>>> Yes as you say when the no of rows to be returned is becoming more > and > > >>> more the latency will be becoming more. seeks within an HFile block > is > > >>> some what expensive op now. (Not much but still) The new encoding > > >>> prefix > > >>> trie will be a huge bonus here. There the seeks will be flying.. [Ted > > >> also > > >>> presented this in the Hadoop China] Thanks to Matt... :) I am > trying > > to > > >>> measure the scan performance with this new encoding . Trying to >back > > >> port > > >>> a simple patch for 94 version just for testing... Yes when the no > of > > >>> results to be returned is more and more any index will become less > > >>> performing as per my study :) > > >>> > > >>> yes, you are right, I guess it's just a drawback of any index > approach. > > >>> Thanks for the explanation. > > >>> > > >>> Shengjie > > >>> > > >>> On 28 December 2012 04:14, Anoop Sam John <[email protected]> > wrote: > > >>> > > >>>>> Do you have link to that presentation? > > >>>> > > >>>> http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf > > >>>> > > >>>> -Anoop- > > >>>> > > >>>> ________________________________________ > > >>>> From: Mohit Anchlia [[email protected]] > > >>>> Sent: Friday, December 28, 2012 9:12 AM > > >>>> To: [email protected] > > >>>> Subject: Re: HBase - Secondary Index > > >>>> > > >>>> On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John <[email protected] > > > > >>>> wrote: > > >>>> > > >>>>> Yes as you say when the no of rows to be returned is becoming more > > >> and > > >>>>> more the latency will be becoming more. seeks within an HFile > block > > >> is > > >>>>> some what expensive op now. (Not much but still) The new encoding > > >>> prefix > > >>>>> trie will be a huge bonus here. There the seeks will be flying.. > [Ted > > >>>> also > > >>>>> presented this in the Hadoop China] Thanks to Matt... :) I am > > >> trying > > >>> to > > >>>>> measure the scan performance with this new encoding . Trying to > back > > >>>> port a > > >>>>> simple patch for 94 version just for testing... Yes when the no > of > > >>>>> results to be returned is more and more any index will become less > > >>>>> performing as per my study :) > > >>>>> > > >>>>> Do you have link to that presentation? > > >>>> > > >>>> > > >>>>>> btw, quick question- in your presentation, the scale there is > > >> seconds > > >>> or > > >>>>> mill-seconds:) > > >>>>> > > >>>>> It is seconds. Dont consider the exact values. What is the % of > > >>> increase > > >>>>> in latency is important :) Those were not high end machines. > > >>>>> > > >>>>> -Anoop- > > >>>>> ________________________________________ > > >>>>> From: Shengjie Min [[email protected]] > > >>>>> Sent: Thursday, December 27, 2012 9:59 PM > > >>>>> To: [email protected] > > >>>>> Subject: Re: HBase - Secondary Index > > >>>>> > > >>>>>> Didnt follow u completely here. There wont be any get() > happening.. > > >>> As > > >>>>> the > > >>>>>> exact rowkey in a region we get from the index table, we can seek > to > > >>> the > > >>>>>> exact position and return that row. > > >>>>> > > >>>>> Sorry, When I misused "get()" here, I meant seeking. Yes, if it's > > >> just > > >>>>> small number of rows returned, this works perfect. As you said you > > >> will > > >>>> get > > >>>>> the exact rowkey positions per region, and simply seek them. I was > > >>> trying > > >>>>> to work out the case that when the number of result rows increases > > >>>>> massively. Like in Anil's case, he wants to do a scan query against > > >> the > > >>>>> 2ndary index(timestamp): "select all rows from timestamp1 to > > >>> timestamp2" > > >>>>> given no customerId provided. During that time period, he might > have > > >> a > > >>>> big > > >>>>> chunk of rows from different customerIds. The index table returns a > > >> lot > > >>>> of > > >>>>> rowkey positions for different customerIds (I believe they are > > >>> scattered > > >>>> in > > >>>>> different regions), then you end up seeking all different positions > > >> in > > >>>>> different regions and return all the rows needed. According to your > > >>>>> presentation page14 - Performance Test Results (Scan), without > index, > > >>>> it's > > >>>>> a linear increase as result rows # increases. on the other hand, > with > > >>>>> index, time spent climbs up way quicker than the case without > index. > > >>>>> > > >>>>> btw, quick question- in your presentation, the scale there is > seconds > > >>> or > > >>>>> mill-seconds:) > > >>>>> > > >>>>> - Shengjie > > >>>>> > > >>>>> > > >>>>> On 27 December 2012 15:54, Anoop John <[email protected]> > wrote: > > >>>>> > > >>>>>>> how the massive number of get() is going to > > >>>>>> perform againt the main table > > >>>>>> > > >>>>>> Didnt follow u completely here. There wont be any get() > happening.. > > >>> As > > >>>>> the > > >>>>>> exact rowkey in a region we get from the index table, we can seek > > >> to > > >>>> the > > >>>>>> exact position and return that row. > > >>>>>> > > >>>>>> -Anoop- > > >>>>>> > > >>>>>> On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min < > > >> [email protected]> > > >>>>>> wrote: > > >>>>>> > > >>>>>>> how the massive number of get() is going to > > >>>>>>> perform againt the main table > > >>>>>>> > > >>>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> -- > > >>>>> All the best, > > >>>>> Shengjie Min > > >>>>> > > >>>> > > >>> > > >>> > > >>> > > >>> -- > > >>> All the best, > > >>> Shengjie Min > > >>> > > >> > > >> > > >> > > >> -- > > >> Adrien Mogenet > > >> 06.59.16.64.22 > > >> http://www.mogenet.me > > >> > > > > > -- > Thanks & Regards, > Anil Gupta >
