A proportional difference in time taken, wrt increase in # RSs (keeping No#rows matching values constant), would be what is of utmost interest.
-Anoop- On Fri, Jan 3, 2014 at 3:49 PM, rajeshbabu chintaguntla < [email protected]> wrote: > > Here are some performance numbers with RLI. > > No Region servers : 4 > Data per region : 2 GB > > Regions/RS| Total regions| Blocksize(kb) |No#rows matching values| Time > taken(sec)| > 50 | 200| 64|199|102 > 50 | 200|8|199| 35 > 100|400 | 8| 350| 95 > 200| 800| 8| 353| 153 > > Without secondary index scan is taking in hours. > > > Thanks, > Rajeshbabu > ________________________________________ > From: Anoop John [[email protected]] > Sent: Friday, January 03, 2014 3:22 PM > To: [email protected] > Subject: Re: secondary index feature > > >Is there any data on how RLI (or in particular Phoenix) query throughput > correlates with the number of region servers assuming homogeneously > distributed data? > > Phoenix is yet to add RLI. Now it is having global indexing only. Correct > James? > > RLI impl from Huawei (HIndex) is having some numbers wrt regions.. But I > doubt whether it is there large no# RSs. Do you have some data Rajesh > Babu? > > -Anoop- > > On Fri, Jan 3, 2014 at 3:11 PM, Henning Blohm <[email protected] > >wrote: > > > Jesse, James, Lars, > > > > after looking around a bit and in particular looking into Phoenix (which > I > > find very interesting), assuming that you want a secondary indexing on > > HBASE without adding other infrastructure, there seems to be not a lot of > > choice really: Either go with a region-level (and co-processor based) > > indexing feature (Phoenix, Huawei, is IHBase dead?) or add an index table > > to store (index value, entity key) pairs. > > > > The main concern I have with region-level indexing (RLI) is that Gets > > potentially require to visit all regions. Compared to global index tables > > this seems to flatten the read-scalability curve of the cluster. In our > > case, we have a large data set (hence HBASE) that will be queried (mostly > > point-gets via an index) in some linear correlation with its size. > > > > Is there any data on how RLI (or in particular Phoenix) query throughput > > correlates with the number of region servers assuming homogeneously > > distributed data? > > > > Thanks, > > Henning > > > > > > > > > > On 24.12.2013 12:18, Henning Blohm wrote: > > > >> All that sounds very promising. I will give it a try and let you know > >> how things worked out. > >> > >> Thanks, > >> Henning > >> > >> On 12/23/2013 08:10 PM, Jesse Yates wrote: > >> > >>> The work that James is referencing grew out of the discussions Lars > >>> and I > >>> had (which lead to those blog posts). The solution we implement is > >>> designed > >>> to be generic, as James mentioned above, but was written with all the > >>> hooks > >>> necessary for Phoenix to do some really fast updates (or skipping > updates > >>> in the case where there is no change). > >>> > >>> You should be able to plug in your own simple index builder (there is > >>> an example > >>> in the phoenix codebase<https://github.com/forcedotcom/phoenix/tree/ > >>> master/src/main/java/com/salesforce/hbase/index/covered/example>) > >>> to basic solution which supports the same transactional guarantees as > >>> HBase > >>> (per row) + data guarantees across the index rows. There are more > details > >>> in the presentations James linked. > >>> > >>> I'd love you see if your implementation can fit into the framework we > >>> wrote > >>> - we would be happy to work to see if it needs some more hooks or > >>> modifications - I have a feeling this is pretty much what you guys will > >>> need > >>> > >>> -Jesse > >>> > >>> > >>> On Mon, Dec 23, 2013 at 10:01 AM, James Taylor<[email protected]> > >>> wrote: > >>> > >>> Henning, > >>>> Jesse Yates wrote the back-end of our global secondary indexing system > >>>> in > >>>> Phoenix. He designed it as a separate, pluggable module with no > Phoenix > >>>> dependencies. Here's an overview of the feature: > >>>> https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing. The > >>>> section that discusses the data guarantees and failure management > might > >>>> be > >>>> of interest to you: > >>>> https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing#data- > >>>> guarantees-and-failure-management > >>>> > >>>> This presentation also gives a good overview of the pluggability of > his > >>>> implementation: > >>>> http://files.meetup.com/1350427/PhoenixIndexing-SF-HUG_09-26-13.pptx > >>>> > >>>> Thanks, > >>>> James > >>>> > >>>> > >>>> On Mon, Dec 23, 2013 at 3:47 AM, Henning Blohm< > [email protected] > >>>> >wrote: > >>>> > >>>> Lars, that is exactly why I am hesitant to use one the core level > >>>>> generic > >>>>> approaches (apart from having difficulties to identify the still > active > >>>>> projects): I have doubts I can sufficiently explain to myself when > and > >>>>> where they fail. > >>>>> > >>>>> With "toolbox approach" I meant to say that turning entity data into > >>>>> index data is not done generically but rather involving domain > specific > >>>>> application code that > >>>>> > >>>>> - indicates what makes an index key given an entity > >>>>> - indicates whether an index entry is still valid given an entity > >>>>> > >>>>> That code is also used during the index rebuild and trimming (an M/R > >>>>> Job) > >>>>> > >>>>> So validating whether an index entry is valid means to load the > entity > >>>>> pointed to and - before considering it a valid result - validating > >>>>> whether > >>>>> values of the entity still match with the index. > >>>>> > >>>>> The entity is written last, hence when the client dies halfway > through > >>>>> the update you may get stale index entries but nothing else should > >>>>> break. > >>>>> > >>>>> For scanning along the index, we are using a chunk iterator that is, > we > >>>>> read n index entries ahead and then do point look ups for the > >>>>> entities. How > >>>>> would you avoid point-gets when scanning via an index (as most > likely, > >>>>> entities are ordered independently from the index - hence the index)? > >>>>> > >>>>> Something really important to note is that there is no intention to > >>>>> build > >>>>> a completely generic solution, in particular not (this time - unlike > >>>>> the > >>>>> other post of mine you responded to) taking row versioning into > >>>>> account. > >>>>> Instead, row time stamps are used to delete stale entries (old > entries > >>>>> after an index rebuild). > >>>>> > >>>>> Thanks a lot for your blog pointers. Haven't had time to study in > depth > >>>>> but at first glance there is lot of overlap of what you are proposing > >>>>> and > >>>>> what I ended up doing considering the first post. > >>>>> > >>>>> On the second post: Indeed I have not worried too much about > >>>>> transactional isolation of updates. If index update and entity update > >>>>> use > >>>>> the same HBase time stamp, the result should at least be consistent, > >>>>> right? > >>>>> > >>>>> Btw. in no way am I claiming originality of my thoughts - in > >>>>> particular I > >>>>> readhttp://jyates.github.io/2012/07/09/consistent-enough- > >>>>> > >>>>> secondary-indexes.html a while back. > >>>>> > >>>>> Thanks, > >>>>> Henning > >>>>> > >>>>> Ps.: I might write about this discussion later in my blog > >>>>> > >>>>> > >>>>> On 22.12.2013 23:37, lars hofhansl wrote: > >>>>> > >>>>> The devil is often in the details. On the surface it looks simple. > >>>>>> > >>>>>> How specifically are the stale indexes ignored? Are the guaranteed > to > >>>>>> be > >>>>>> no races? > >>>>>> Is deletion handled correctly?Does it work with multiple versions? > >>>>>> What happens when the client dies 1/2 way through an update? > >>>>>> It's easy to do eventually consistent indexes. Truly consistent > >>>>>> indexes > >>>>>> without transactions are tricky. > >>>>>> > >>>>>> > >>>>>> Also, scanning an index and then doing point-gets against a main > table > >>>>>> is slow (unless the index is very selective. The Phoenix team > >>>>>> measured that > >>>>>> there is only an advantage if the index filters out 98-99% of the > >>>>>> data). > >>>>>> So then one would revert to covered indexes and suddenly is not so > >>>>>> easy > >>>>>> to detect stale index entries. > >>>>>> > >>>>>> I blogged about these issues here: > >>>>>> http://hadoop-hbase.blogspot.com/2012/10/musings-on- > >>>>>> secondary-indexes.html > >>>>>> http://hadoop-hbase.blogspot.com/2012/10/secondary-indexes- > >>>>>> part-ii.html > >>>>>> > >>>>>> Phoenix has a (pretty involved) solution now that works around the > >>>>>> fact > >>>>>> that HBase has no transactions. > >>>>>> > >>>>>> > >>>>>> -- Lars > >>>>>> > >>>>>> > >>>>>> > >>>>>> ________________________________ > >>>>>> From: Henning Blohm<[email protected]> > >>>>>> To: user<[email protected]> > >>>>>> Sent: Sunday, December 22, 2013 2:11 AM > >>>>>> Subject: secondary index feature > >>>>>> > >>>>>> Lately we have added a secondary index feature to a persistence tier > >>>>>> over HBASE. Essentially we implemented what is described as > >>>>>> "Dual-Write > >>>>>> Secondary Index" inhttp://hbase.apache.org/ > >>>>>> book/secondary.indexes.html. > >>>>>> > >>>>>> I.e. while updating an entity, actually before writing the actual > >>>>>> update, indexes are updated. Lookup via the index ignores stale > >>>>>> entries. > >>>>>> A recurring rebuild and clean out of stale entries takes care the > >>>>>> indexes are trimmed and accurate. > >>>>>> > >>>>>> None of this was terribly complex to implement. In fact, it seemed > >>>>>> like > >>>>>> something you could do generically, maybe not on the HBASE level > >>>>>> itself, > >>>>>> but as a toolbox / utility style library. > >>>>>> > >>>>>> Is anybody on the list aware of anything useful already existing in > >>>>>> that > >>>>>> space? > >>>>>> > >>>>>> Thanks, > >>>>>> Henning Blohm > >>>>>> > >>>>>> *ZFabrik Software KG* > >>>>>> > >>>>>> T: +49 6227 3984255< > https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# > > > >>>>>> F: +49 6227 3984254< > https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# > > > >>>>>> M: +49 1781891820< > https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# > > > >>>>>> > >>>>>> Lammstrasse 2 69190 Walldorf > >>>>>> > >>>>>> [email protected] <mailto:[email protected]> > >>>>>> Linkedin<http://www.linkedin.com/pub/henning-blohm/0/7b5/628> > >>>>>> ZFabrik<http://www.zfabrik.de> > >>>>>> Blog<http://www.z2-environment.net/blog> > >>>>>> Z2-Environment<http://www.z2-environment.eu> > >>>>>> Z2 Wiki<http://redmine.z2-environment.net> > >>>>>> > >>>>>> -- > >>>>> Henning Blohm > >>>>> > >>>>> *ZFabrik Software KG* > >>>>> > >>>>> T: +49 6227 3984255< > https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# > > > >>>>> F: +49 6227 3984254< > https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# > > > >>>>> M: +49 1781891820< > https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# > > > >>>>> > >>>>> Lammstrasse 2 69190 Walldorf > >>>>> > >>>>> [email protected] <mailto:[email protected]> > >>>>> Linkedin<http://www.linkedin.com/pub/henning-blohm/0/7b5/628> > >>>>> ZFabrik<http://www.zfabrik.de> > >>>>> Blog<http://www.z2-environment.net/blog> > >>>>> Z2-Environment<http://www.z2-environment.eu> > >>>>> Z2 Wiki<http://redmine.z2-environment.net> > >>>>> > >>>>> > >>>>> > >> > >> -- > >> Henning Blohm > >> > >> *ZFabrik Software KG* > >> > >> T: +49 6227 3984255< > https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# > > > >> F: +49 6227 3984254< > https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# > > > >> M: +49 1781891820< > https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# > > > >> > >> Lammstrasse 2 69190 Walldorf > >> > >> [email protected] <mailto:[email protected]> > >> Linkedin <http://www.linkedin.com/pub/henning-blohm/0/7b5/628> > >> ZFabrik <http://www.zfabrik.de> > >> Blog <http://www.z2-environment.net/blog> > >> Z2-Environment <http://www.z2-environment.eu> > >> Z2 Wiki <http://redmine.z2-environment.net> > >> > >> > > > > -- > > Henning Blohm > > > > *ZFabrik Software KG* > > > > T: +49 6227 3984255< > https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# > > > > F: +49 6227 3984254< > https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# > > > > M: +49 1781891820< > https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# > > > > > > Lammstrasse 2 69190 Walldorf > > > > [email protected] <mailto:[email protected]> > > Linkedin <http://www.linkedin.com/pub/henning-blohm/0/7b5/628> > > ZFabrik <http://www.zfabrik.de> > > Blog <http://www.z2-environment.net/blog> > > Z2-Environment <http://www.z2-environment.eu> > > Z2 Wiki <http://redmine.z2-environment.net> > > > > >
