Re: secondary index feature

James Taylor Fri, 03 Jan 2014 12:55:15 -0800

Hi Henning,
Phoenix maintains a global index. It is essentially maintaining another
HBase table for you with a different row key (and a subset of your data
table columns that are "covered"). When an index is used by Phoenix, it is
*exactly* like querying a data table (that's what Phoenix does - it ends up
issuing a Phoenix query against a Phoenix table that happens to be an index
table).


The hit you take for a global index is at write time - we need to look up
the prior state of the rows being updated to do the index maintenance. Then
we need to do a write to the index table. The upside is that there's no hit
at read/query time (we don't yet attempt to join from the index table back
to the data table - if a query is using columns that aren't in the index,
it simply won't be used). More here:
https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing

Thanks,
James


On Fri, Jan 3, 2014 at 12:46 PM, Henning Blohm <[email protected]>wrote:

> When scanning in order of an index and you use RLI, it seems, there is no
> alternative but to involve all regions - and essentially this should happen
> in parallel as otherwise you might not get what you wanted. Also, for a
> single Get, it seems (as Lars pointed out in https://issues.apache.org/
> jira/browse/HBASE-2038) that you have to consult all regions.
>
> When that parallelism is no problem (small number of servers) it will
> actually help single scan performance (regions can provide their share in
> parallel).
>
> A high number of concurrent client requests leads to the same number of
> requests on all regions and its multiple of connections to be maintained by
> the client.
>
> My assumption is that that will eventually lead to a scalability problem -
> when, say, having a 100 region servers or so in place. I was wondering, if
> anyone has experience with that.
>
> That will be perfectly acceptable for many use cases that benefit from the
> scan (and hence query) performance more than they suffer from the load
> problem. Other use cases have less requirements on scans and query
> flexibility but rather want to preserve the quality that a Get has fixed
> resource usage.
>
> Btw.: I was convinces that Phoenix is keeping indexes on the region level.
> Is that not so?
>
> Thanks,
> Henning
>
>
> On 03.01.2014 17:57, Anoop John wrote:
>
>> In case of HBase normal scan as we know, regions will be scanned
>> sequentially.  Pheonix having parallel scan impls in it.  When RLI is used
>> and we make use of index completely at server side, it is irrespective of
>> client scan ways. Sequential or parallel, using java or any other client
>> layer or using SQL layer like Phoenix, using MR or not...  all client side
>> dont have to worry abt this but the usage will be fully at server end.
>>
>> Yes when parallel scan is done on regions, RLI might perform much better.
>>
>> -Anoop-
>>
>> On Fri, Jan 3, 2014 at 7:35 PM, rajeshbabu chintaguntla <
>> [email protected]> wrote:
>>
>>  No. the regions scanned sequentially.
>>> ________________________________________
>>> From: Asaf Mesika [[email protected]]
>>> Sent: Friday, January 03, 2014 7:26 PM
>>> To: [email protected]
>>>   Subject: Re: secondary index feature
>>>
>>> Are the regions scanned in parallel?
>>>
>>> On Friday, January 3, 2014, rajeshbabu chintaguntla wrote:
>>>
>>>  Here are some performance numbers with RLI.
>>>>
>>>> No Region servers : 4
>>>> Data per region    : 2 GB
>>>>
>>>> Regions/RS| Total regions|  Blocksize(kb) |No#rows matching values| Time
>>>> taken(sec)|
>>>>   50 | 200| 64|199|102
>>>> 50  | 200|8|199| 35
>>>> 100|400 | 8| 350| 95
>>>> 200| 800| 8| 353| 153
>>>>
>>>> Without secondary index scan is taking in hours.
>>>>
>>>>
>>>> Thanks,
>>>> Rajeshbabu
>>>> ________________________________________
>>>> From: Anoop John [[email protected] <javascript:;>]
>>>> Sent: Friday, January 03, 2014 3:22 PM
>>>> To: [email protected] <javascript:;>
>>>> Subject: Re: secondary index feature
>>>>
>>>>  Is there any data on how RLI (or in particular Phoenix) query
>>>>> throughput
>>>>>
>>>> correlates with the number of region servers assuming homogeneously
>>>> distributed data?
>>>>
>>>> Phoenix is yet to add RLI. Now it is having global indexing only.
>>>> Correct
>>>> James?
>>>>
>>>> RLI impl from Huawei (HIndex) is having some numbers wrt regions.. But I
>>>> doubt whether it is there large no# RSs.  Do you have some data Rajesh
>>>> Babu?
>>>>
>>>> -Anoop-
>>>>
>>>> On Fri, Jan 3, 2014 at 3:11 PM, Henning Blohm <[email protected]
>>>>
>>>>> wrote:
>>>>> Jesse, James, Lars,
>>>>>
>>>>> after looking around a bit and in particular looking into Phoenix
>>>>>
>>>> (which
>>>
>>>> I
>>>>
>>>>> find very interesting), assuming that you want a secondary indexing on
>>>>> HBASE without adding other infrastructure, there seems to be not a lot
>>>>>
>>>> of
>>>
>>>> choice really: Either go with a region-level (and co-processor based)
>>>>> indexing feature (Phoenix, Huawei, is IHBase dead?) or add an index
>>>>>
>>>> table
>>>
>>>> to store (index value, entity key) pairs.
>>>>>
>>>>> The main concern I have with region-level indexing (RLI) is that Gets
>>>>> potentially require to visit all regions. Compared to global index
>>>>>
>>>> tables
>>>
>>>> this seems to flatten the read-scalability curve of the cluster. In our
>>>>> case, we have a large data set (hence HBASE) that will be queried
>>>>>
>>>> (mostly
>>>
>>>> point-gets via an index) in some linear correlation with its size.
>>>>>
>>>>> Is there any data on how RLI (or in particular Phoenix) query
>>>>>
>>>> throughput
>>>
>>>> correlates with the number of region servers assuming homogeneously
>>>>> distributed data?
>>>>>
>>>>> Thanks,
>>>>> Henning
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 24.12.2013 12:18, Henning Blohm wrote:
>>>>>
>>>>>    All that sounds very promising. I will give it a try and let you
>>>>>> know
>>>>>> how things worked out.
>>>>>>
>>>>>> Thanks,
>>>>>> Henning
>>>>>>
>>>>>> On 12/23/2013 08:10 PM, Jesse Yates wrote:
>>>>>>
>>>>>>    The work that James is referencing grew out of the discussions Lars
>>>>>>> and I
>>>>>>> had (which lead to those blog posts). The solution we implement is
>>>>>>> designed
>>>>>>> to be generic, as James mentioned above, but was written with all the
>>>>>>> hooks
>>>>>>> necessary for Phoenix to do some really fast updates (or skipping
>>>>>>>
>>>>>> updates
>>>>
>>>>> in the case where there is no change).
>>>>>>>
>>>>>>> You should be able to plug in your own simple index builder (there is
>>>>>>> an example
>>>>>>> in the phoenix codebase<https://github.com/forcedotcom/phoenix/tree/
>>>>>>> master/src/main/java/com/salesforce/hbase/index/covered/example>)
>>>>>>> to basic solution which supports the same transactional guarantees as
>>>>>>> HBase
>>>>>>> (per row) + data guarantees across the index rows. There are more
>>>>>>>
>>>>>> details
>>>>
>>>>> in the presentations James linked.
>>>>>>>
>>>>>>> I'd love you see if your implementation can fit into the framework we
>>>>>>> wrote
>>>>>>> - we would be happy to work to see if it needs some more hooks or
>>>>>>> modifications - I have a feeling this is pretty much what you guys
>>>>>>>
>>>>>> will
>>>
>>>> need
>>>>>>>
>>>>>>> -Jesse
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Dec 23, 2013 at 10:01 AM, James Taylor<
>>>>>>>
>>>>>> [email protected]>
>>>
>>>> wrote:
>>>>>>>
>>>>>>>   Henning,
>>>>>>>
>>>>>>>> Jesse Yates wrote the back-end of our global secondary indexing
>>>>>>>>
>>>>>>> system
>>>
>>>> in
>>>>>>>> Phoenix. He designed it as a separate, pluggable module with no
>>>>>>>>
>>>>>>> Phoenix
>>>>
>>>>> dependencies. Here's an overview of the feature:
>>>>>>>> https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing. The
>>>>>>>> section that discusses the data guarantees and failure management
>>>>>>>>
>>>>>>> might
>>>>
>>>>> be
>>>>>>>> of interest to you:
>>>>>>>>
>>>>>>>>  https://github.com/forcedotcom/phoenix/wiki/
>>> Secondary-Indexing#data-
>>>
>>>> guarantees-and-failure-management
>>>>>>>>
>>>>>>>> This presentation also gives a good overview of the pluggability of
>>>>>>>>
>>>>>>> his
>>>>
>>>>
>
> --
> Henning Blohm
>
> *ZFabrik Software KG*
>
> T:      +49 6227 3984255
> F:      +49 6227 3984254
> M:      +49 1781891820
>
> Lammstrasse 2 69190 Walldorf
>
> [email protected] <mailto:[email protected]>
> Linkedin <http://www.linkedin.com/pub/henning-blohm/0/7b5/628>
> ZFabrik <http://www.zfabrik.de>
> Blog <http://www.z2-environment.net/blog>
> Z2-Environment <http://www.z2-environment.eu>
> Z2 Wiki <http://redmine.z2-environment.net>
>
>

Re: secondary index feature

Reply via email to