Can you provide a use case?
Sent from a remote device. Please excuse any typos... Mike Segel On Jan 8, 2013, at 6:30 PM, lars hofhansl <[email protected]> wrote: > Different use cases. > > > For global point queries you want exactly what you said below. > For range scans across many rows you want Anoop's design. As usually it > depends. > > > The tradeoff is bringing a lot of unnecessary data to the client vs having to > contact each region (or at least each region server). > > > -- Lars > > > > ________________________________ > From: Michael Segel <[email protected]> > To: [email protected] > Sent: Tuesday, January 8, 2013 6:33 AM > Subject: Re: HBase - Secondary Index > > So if you're using an inverted table / index why on earth are you doing it at > the region level? > > I've tried to explain this to others over 6 months ago and its not really a > good idea. > > You're over complicating this and you will end up creating performance > bottlenecks when your secondary index is completely orthogonal to your row > key. > > To give you an example... > > Suppose you're CCCIS and you have a large database of auto insurance claims > that you've acquired over the years from your Pathways product. > > Your primary key would be a combination of the Insurance Company's ID and > their internal claim ID for the individual claim. > Your row would be all of the data associated to that claim. > > So now lets say you want to find the average cost to repair a front end > collision of an S80 Volvo. > The make and model of the car would be orthogonal to the initial key. This > means that the result set containing insurance records for Front End > collisions of S80 Volvos would be most likely evenly distributed across the > cluster's regions. > > If you used a series of inverted tables, you would be able to use a series of > get()s to get the result set from each index and then find their > intersections. (Note that you could also put them in sort order so that the > intersections would be fairly straight forward to find. > > Doing this at the region level isn't so simple. > > So I have to again ask why go through and over complicate things? > > Just saying... > > On Jan 7, 2013, at 7:49 AM, Anoop Sam John <[email protected]> wrote: > >> Hi, >> It is inverted index based on column(s) value(s) >> It will be region wise indexing. Can work when some one knows the rowkey >> range or NOT. >> >> -Anoop- >> ________________________________________ >> From: Mohit Anchlia [[email protected]] >> Sent: Monday, January 07, 2013 9:47 AM >> To: [email protected] >> Subject: Re: HBase - Secondary Index >> >> Hi Anoop, >> >> Am I correct in understanding that this indexing mechanism is only >> applicable when you know the row key? It's not an inverted index truly >> based on the column value. >> >> Mohit >> On Sun, Jan 6, 2013 at 7:48 PM, Anoop Sam John <[email protected]> wrote: >> >>> Hi Adrien >>> We are making the consistency btw the main table and >>> index table and the roll back mentioned below etc using the CP hooks. The >>> current hooks were not enough for those though.. I am in the process of >>> trying to contribute those new hooks, core changes etc now... Once all are >>> done I will be able to explain in details.. >>> >>> -Anoop- >>> ________________________________________ >>> From: Adrien Mogenet [[email protected]] >>> Sent: Monday, January 07, 2013 2:00 AM >>> To: [email protected] >>> Subject: Re: HBase - Secondary Index >>> >>> Nice topic, perhaps one of the most important for 2013 :-) >>> I still don't get how you're ensuring consistency between index table and >>> main table, without an external component (such as bookkeeper/zookeeper). >>> What's the exact write path in your situation when inserting data ? >>> (WAL/RegionObserver, pre/post put/WALedit...) >>> >>> The underlying question is about how you're ensuring that WALEdit in Index >>> and Main tables are perfectly sync'ed, and how you 're able to rollback in >>> case of issue in both WAL ? >>> >>> >>> On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <[email protected]> >>> wrote: >>> >>>>> Yes as you say when the no of rows to be returned is becoming more and >>>> more the latency will be becoming more. seeks within an HFile block is >>>> some what expensive op now. (Not much but still) The new encoding >>>> prefix >>>> trie will be a huge bonus here. There the seeks will be flying.. [Ted >>> also >>>> presented this in the Hadoop China] Thanks to Matt... :) I am trying to >>>> measure the scan performance with this new encoding . Trying to >back >>> port >>>> a simple patch for 94 version just for testing... Yes when the no of >>>> results to be returned is more and more any index will become less >>>> performing as per my study :) >>>> >>>> yes, you are right, I guess it's just a drawback of any index approach. >>>> Thanks for the explanation. >>>> >>>> Shengjie >>>> >>>> On 28 December 2012 04:14, Anoop Sam John <[email protected]> wrote: >>>> >>>>>> Do you have link to that presentation? >>>>> >>>>> http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf >>>>> >>>>> -Anoop- >>>>> >>>>> ________________________________________ >>>>> From: Mohit Anchlia [[email protected]] >>>>> Sent: Friday, December 28, 2012 9:12 AM >>>>> To: [email protected] >>>>> Subject: Re: HBase - Secondary Index >>>>> >>>>> On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John <[email protected]> >>>>> wrote: >>>>> >>>>>> Yes as you say when the no of rows to be returned is becoming more >>> and >>>>>> more the latency will be becoming more. seeks within an HFile block >>> is >>>>>> some what expensive op now. (Not much but still) The new encoding >>>> prefix >>>>>> trie will be a huge bonus here. There the seeks will be flying.. [Ted >>>>> also >>>>>> presented this in the Hadoop China] Thanks to Matt... :) I am >>> trying >>>> to >>>>>> measure the scan performance with this new encoding . Trying to back >>>>> port a >>>>>> simple patch for 94 version just for testing... Yes when the no of >>>>>> results to be returned is more and more any index will become less >>>>>> performing as per my study :) >>>>>> >>>>>> Do you have link to that presentation? >>>>> >>>>> >>>>>>> btw, quick question- in your presentation, the scale there is >>> seconds >>>> or >>>>>> mill-seconds:) >>>>>> >>>>>> It is seconds. Dont consider the exact values. What is the % of >>>> increase >>>>>> in latency is important :) Those were not high end machines. >>>>>> >>>>>> -Anoop- >>>>>> ________________________________________ >>>>>> From: Shengjie Min [[email protected]] >>>>>> Sent: Thursday, December 27, 2012 9:59 PM >>>>>> To: [email protected] >>>>>> Subject: Re: HBase - Secondary Index >>>>>> >>>>>>> Didnt follow u completely here. There wont be any get() happening.. >>>> As >>>>>> the >>>>>>> exact rowkey in a region we get from the index table, we can seek to >>>> the >>>>>>> exact position and return that row. >>>>>> >>>>>> Sorry, When I misused "get()" here, I meant seeking. Yes, if it's >>> just >>>>>> small number of rows returned, this works perfect. As you said you >>> will >>>>> get >>>>>> the exact rowkey positions per region, and simply seek them. I was >>>> trying >>>>>> to work out the case that when the number of result rows increases >>>>>> massively. Like in Anil's case, he wants to do a scan query against >>> the >>>>>> 2ndary index(timestamp): "select all rows from timestamp1 to >>>> timestamp2" >>>>>> given no customerId provided. During that time period, he might have >>> a >>>>> big >>>>>> chunk of rows from different customerIds. The index table returns a >>> lot >>>>> of >>>>>> rowkey positions for different customerIds (I believe they are >>>> scattered >>>>> in >>>>>> different regions), then you end up seeking all different positions >>> in >>>>>> different regions and return all the rows needed. According to your >>>>>> presentation page14 - Performance Test Results (Scan), without index, >>>>> it's >>>>>> a linear increase as result rows # increases. on the other hand, with >>>>>> index, time spent climbs up way quicker than the case without index. >>>>>> >>>>>> btw, quick question- in your presentation, the scale there is seconds >>>> or >>>>>> mill-seconds:) >>>>>> >>>>>> - Shengjie >>>>>> >>>>>> >>>>>> On 27 December 2012 15:54, Anoop John <[email protected]> wrote: >>>>>> >>>>>>>> how the massive number of get() is going to >>>>>>> perform againt the main table >>>>>>> >>>>>>> Didnt follow u completely here. There wont be any get() happening.. >>>> As >>>>>> the >>>>>>> exact rowkey in a region we get from the index table, we can seek >>> to >>>>> the >>>>>>> exact position and return that row. >>>>>>> >>>>>>> -Anoop- >>>>>>> >>>>>>> On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min < >>> [email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> how the massive number of get() is going to >>>>>>>> perform againt the main table >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> All the best, >>>>>> Shengjie Min >>>> >>>> >>>> >>>> -- >>>> All the best, >>>> Shengjie Min >>> >>> >>> >>> -- >>> Adrien Mogenet >>> 06.59.16.64.22 >>> http://www.mogenet.me
