Hi Henning, Phoenix maintains a global index. It is essentially maintaining another HBase table for you with a different row key (and a subset of your data table columns that are "covered"). When an index is used by Phoenix, it is *exactly* like querying a data table (that's what Phoenix does - it ends up issuing a Phoenix query against a Phoenix table that happens to be an index table).
The hit you take for a global index is at write time - we need to look up the prior state of the rows being updated to do the index maintenance. Then we need to do a write to the index table. The upside is that there's no hit at read/query time (we don't yet attempt to join from the index table back to the data table - if a query is using columns that aren't in the index, it simply won't be used). More here: https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing Thanks, James On Fri, Jan 3, 2014 at 12:46 PM, Henning Blohm <[email protected]>wrote: > When scanning in order of an index and you use RLI, it seems, there is no > alternative but to involve all regions - and essentially this should happen > in parallel as otherwise you might not get what you wanted. Also, for a > single Get, it seems (as Lars pointed out in https://issues.apache.org/ > jira/browse/HBASE-2038) that you have to consult all regions. > > When that parallelism is no problem (small number of servers) it will > actually help single scan performance (regions can provide their share in > parallel). > > A high number of concurrent client requests leads to the same number of > requests on all regions and its multiple of connections to be maintained by > the client. > > My assumption is that that will eventually lead to a scalability problem - > when, say, having a 100 region servers or so in place. I was wondering, if > anyone has experience with that. > > That will be perfectly acceptable for many use cases that benefit from the > scan (and hence query) performance more than they suffer from the load > problem. Other use cases have less requirements on scans and query > flexibility but rather want to preserve the quality that a Get has fixed > resource usage. > > Btw.: I was convinces that Phoenix is keeping indexes on the region level. > Is that not so? > > Thanks, > Henning > > > On 03.01.2014 17:57, Anoop John wrote: > >> In case of HBase normal scan as we know, regions will be scanned >> sequentially. Pheonix having parallel scan impls in it. When RLI is used >> and we make use of index completely at server side, it is irrespective of >> client scan ways. Sequential or parallel, using java or any other client >> layer or using SQL layer like Phoenix, using MR or not... all client side >> dont have to worry abt this but the usage will be fully at server end. >> >> Yes when parallel scan is done on regions, RLI might perform much better. >> >> -Anoop- >> >> On Fri, Jan 3, 2014 at 7:35 PM, rajeshbabu chintaguntla < >> [email protected]> wrote: >> >> No. the regions scanned sequentially. >>> ________________________________________ >>> From: Asaf Mesika [[email protected]] >>> Sent: Friday, January 03, 2014 7:26 PM >>> To: [email protected] >>> Subject: Re: secondary index feature >>> >>> Are the regions scanned in parallel? >>> >>> On Friday, January 3, 2014, rajeshbabu chintaguntla wrote: >>> >>> Here are some performance numbers with RLI. >>>> >>>> No Region servers : 4 >>>> Data per region : 2 GB >>>> >>>> Regions/RS| Total regions| Blocksize(kb) |No#rows matching values| Time >>>> taken(sec)| >>>> 50 | 200| 64|199|102 >>>> 50 | 200|8|199| 35 >>>> 100|400 | 8| 350| 95 >>>> 200| 800| 8| 353| 153 >>>> >>>> Without secondary index scan is taking in hours. >>>> >>>> >>>> Thanks, >>>> Rajeshbabu >>>> ________________________________________ >>>> From: Anoop John [[email protected] <javascript:;>] >>>> Sent: Friday, January 03, 2014 3:22 PM >>>> To: [email protected] <javascript:;> >>>> Subject: Re: secondary index feature >>>> >>>> Is there any data on how RLI (or in particular Phoenix) query >>>>> throughput >>>>> >>>> correlates with the number of region servers assuming homogeneously >>>> distributed data? >>>> >>>> Phoenix is yet to add RLI. Now it is having global indexing only. >>>> Correct >>>> James? >>>> >>>> RLI impl from Huawei (HIndex) is having some numbers wrt regions.. But I >>>> doubt whether it is there large no# RSs. Do you have some data Rajesh >>>> Babu? >>>> >>>> -Anoop- >>>> >>>> On Fri, Jan 3, 2014 at 3:11 PM, Henning Blohm <[email protected] >>>> >>>>> wrote: >>>>> Jesse, James, Lars, >>>>> >>>>> after looking around a bit and in particular looking into Phoenix >>>>> >>>> (which >>> >>>> I >>>> >>>>> find very interesting), assuming that you want a secondary indexing on >>>>> HBASE without adding other infrastructure, there seems to be not a lot >>>>> >>>> of >>> >>>> choice really: Either go with a region-level (and co-processor based) >>>>> indexing feature (Phoenix, Huawei, is IHBase dead?) or add an index >>>>> >>>> table >>> >>>> to store (index value, entity key) pairs. >>>>> >>>>> The main concern I have with region-level indexing (RLI) is that Gets >>>>> potentially require to visit all regions. Compared to global index >>>>> >>>> tables >>> >>>> this seems to flatten the read-scalability curve of the cluster. In our >>>>> case, we have a large data set (hence HBASE) that will be queried >>>>> >>>> (mostly >>> >>>> point-gets via an index) in some linear correlation with its size. >>>>> >>>>> Is there any data on how RLI (or in particular Phoenix) query >>>>> >>>> throughput >>> >>>> correlates with the number of region servers assuming homogeneously >>>>> distributed data? >>>>> >>>>> Thanks, >>>>> Henning >>>>> >>>>> >>>>> >>>>> >>>>> On 24.12.2013 12:18, Henning Blohm wrote: >>>>> >>>>> All that sounds very promising. I will give it a try and let you >>>>>> know >>>>>> how things worked out. >>>>>> >>>>>> Thanks, >>>>>> Henning >>>>>> >>>>>> On 12/23/2013 08:10 PM, Jesse Yates wrote: >>>>>> >>>>>> The work that James is referencing grew out of the discussions Lars >>>>>>> and I >>>>>>> had (which lead to those blog posts). The solution we implement is >>>>>>> designed >>>>>>> to be generic, as James mentioned above, but was written with all the >>>>>>> hooks >>>>>>> necessary for Phoenix to do some really fast updates (or skipping >>>>>>> >>>>>> updates >>>> >>>>> in the case where there is no change). >>>>>>> >>>>>>> You should be able to plug in your own simple index builder (there is >>>>>>> an example >>>>>>> in the phoenix codebase<https://github.com/forcedotcom/phoenix/tree/ >>>>>>> master/src/main/java/com/salesforce/hbase/index/covered/example>) >>>>>>> to basic solution which supports the same transactional guarantees as >>>>>>> HBase >>>>>>> (per row) + data guarantees across the index rows. There are more >>>>>>> >>>>>> details >>>> >>>>> in the presentations James linked. >>>>>>> >>>>>>> I'd love you see if your implementation can fit into the framework we >>>>>>> wrote >>>>>>> - we would be happy to work to see if it needs some more hooks or >>>>>>> modifications - I have a feeling this is pretty much what you guys >>>>>>> >>>>>> will >>> >>>> need >>>>>>> >>>>>>> -Jesse >>>>>>> >>>>>>> >>>>>>> On Mon, Dec 23, 2013 at 10:01 AM, James Taylor< >>>>>>> >>>>>> [email protected]> >>> >>>> wrote: >>>>>>> >>>>>>> Henning, >>>>>>> >>>>>>>> Jesse Yates wrote the back-end of our global secondary indexing >>>>>>>> >>>>>>> system >>> >>>> in >>>>>>>> Phoenix. He designed it as a separate, pluggable module with no >>>>>>>> >>>>>>> Phoenix >>>> >>>>> dependencies. Here's an overview of the feature: >>>>>>>> https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing. The >>>>>>>> section that discusses the data guarantees and failure management >>>>>>>> >>>>>>> might >>>> >>>>> be >>>>>>>> of interest to you: >>>>>>>> >>>>>>>> https://github.com/forcedotcom/phoenix/wiki/ >>> Secondary-Indexing#data- >>> >>>> guarantees-and-failure-management >>>>>>>> >>>>>>>> This presentation also gives a good overview of the pluggability of >>>>>>>> >>>>>>> his >>>> >>>> > > -- > Henning Blohm > > *ZFabrik Software KG* > > T: +49 6227 3984255 > F: +49 6227 3984254 > M: +49 1781891820 > > Lammstrasse 2 69190 Walldorf > > [email protected] <mailto:[email protected]> > Linkedin <http://www.linkedin.com/pub/henning-blohm/0/7b5/628> > ZFabrik <http://www.zfabrik.de> > Blog <http://www.z2-environment.net/blog> > Z2-Environment <http://www.z2-environment.eu> > Z2 Wiki <http://redmine.z2-environment.net> > >
