Re: secondary index feature

Henning Blohm Fri, 03 Jan 2014 13:12:09 -0800

Hi James,

this is a little embarassing... I even browsed through the code and readit as implementing a region level index.

But now at least I get the restrictions mentioned for using the coveredindexes.


Thanks for clarifying. Guess I need to browse the code a little harder ;-)

Henning

On 03.01.2014 21:53, James Taylor wrote:

Hi Henning,
Phoenix maintains a global index. It is essentially maintaining another
HBase table for you with a different row key (and a subset of your data
table columns that are "covered"). When an index is used by Phoenix, it is
*exactly* like querying a data table (that's what Phoenix does - it ends up
issuing a Phoenix query against a Phoenix table that happens to be an index
table).

The hit you take for a global index is at write time - we need to look up
the prior state of the rows being updated to do the index maintenance. Then
we need to do a write to the index table. The upside is that there's no hit
at read/query time (we don't yet attempt to join from the index table back
to the data table - if a query is using columns that aren't in the index,
it simply won't be used). More here:
https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing

Thanks,
James


On Fri, Jan 3, 2014 at 12:46 PM, Henning Blohm <[email protected]>wrote:

When scanning in order of an index and you use RLI, it seems, there is no
alternative but to involve all regions - and essentially this should happen
in parallel as otherwise you might not get what you wanted. Also, for a
single Get, it seems (as Lars pointed out in https://issues.apache.org/
jira/browse/HBASE-2038) that you have to consult all regions.

When that parallelism is no problem (small number of servers) it will
actually help single scan performance (regions can provide their share in
parallel).

A high number of concurrent client requests leads to the same number of
requests on all regions and its multiple of connections to be maintained by
the client.

My assumption is that that will eventually lead to a scalability problem -
when, say, having a 100 region servers or so in place. I was wondering, if
anyone has experience with that.

That will be perfectly acceptable for many use cases that benefit from the
scan (and hence query) performance more than they suffer from the load
problem. Other use cases have less requirements on scans and query
flexibility but rather want to preserve the quality that a Get has fixed
resource usage.

Btw.: I was convinces that Phoenix is keeping indexes on the region level.
Is that not so?

Thanks,
Henning


On 03.01.2014 17:57, Anoop John wrote:

In case of HBase normal scan as we know, regions will be scanned
sequentially.  Pheonix having parallel scan impls in it.  When RLI is used
and we make use of index completely at server side, it is irrespective of
client scan ways. Sequential or parallel, using java or any other client
layer or using SQL layer like Phoenix, using MR or not...  all client side
dont have to worry abt this but the usage will be fully at server end.

Yes when parallel scan is done on regions, RLI might perform much better.

-Anoop-

On Fri, Jan 3, 2014 at 7:35 PM, rajeshbabu chintaguntla <
[email protected]> wrote:

  No. the regions scanned sequentially.

________________________________________
From: Asaf Mesika [[email protected]]
Sent: Friday, January 03, 2014 7:26 PM
To: [email protected]
   Subject: Re: secondary index feature

Are the regions scanned in parallel?

On Friday, January 3, 2014, rajeshbabu chintaguntla wrote:

  Here are some performance numbers with RLI.

No Region servers : 4
Data per region    : 2 GB

Regions/RS| Total regions|  Blocksize(kb) |No#rows matching values| Time
taken(sec)|
   50 | 200| 64|199|102
50  | 200|8|199| 35
100|400 | 8| 350| 95
200| 800| 8| 353| 153

Without secondary index scan is taking in hours.


Thanks,
Rajeshbabu
________________________________________
From: Anoop John [[email protected] <javascript:;>]
Sent: Friday, January 03, 2014 3:22 PM
To: [email protected] <javascript:;>
Subject: Re: secondary index feature

  Is there any data on how RLI (or in particular Phoenix) query

throughput

correlates with the number of region servers assuming homogeneously
distributed data?

Phoenix is yet to add RLI. Now it is having global indexing only.
Correct
James?

RLI impl from Huawei (HIndex) is having some numbers wrt regions.. But I
doubt whether it is there large no# RSs.  Do you have some data Rajesh
Babu?

-Anoop-

On Fri, Jan 3, 2014 at 3:11 PM, Henning Blohm <[email protected]

wrote:
Jesse, James, Lars,

after looking around a bit and in particular looking into Phoenix

(which
I

find very interesting), assuming that you want a secondary indexing on
HBASE without adding other infrastructure, there seems to be not a lot

of
choice really: Either go with a region-level (and co-processor based)

indexing feature (Phoenix, Huawei, is IHBase dead?) or add an index

table
to store (index value, entity key) pairs.

The main concern I have with region-level indexing (RLI) is that Gets
potentially require to visit all regions. Compared to global index

tables
this seems to flatten the read-scalability curve of the cluster. In our

case, we have a large data set (hence HBASE) that will be queried

(mostly
point-gets via an index) in some linear correlation with its size.

Is there any data on how RLI (or in particular Phoenix) query

throughput
correlates with the number of region servers assuming homogeneously

distributed data?

Thanks,
Henning




On 24.12.2013 12:18, Henning Blohm wrote:

    All that sounds very promising. I will give it a try and let you

know
how things worked out.

Thanks,
Henning

On 12/23/2013 08:10 PM, Jesse Yates wrote:

    The work that James is referencing grew out of the discussions Lars

and I
had (which lead to those blog posts). The solution we implement is
designed
to be generic, as James mentioned above, but was written with all the
hooks
necessary for Phoenix to do some really fast updates (or skipping

updates

in the case where there is no change).

You should be able to plug in your own simple index builder (there is
an example
in the phoenix codebase<https://github.com/forcedotcom/phoenix/tree/
master/src/main/java/com/salesforce/hbase/index/covered/example>)
to basic solution which supports the same transactional guarantees as
HBase
(per row) + data guarantees across the index rows. There are more

details

in the presentations James linked.

I'd love you see if your implementation can fit into the framework we
wrote
- we would be happy to work to see if it needs some more hooks or
modifications - I have a feeling this is pretty much what you guys

will

need

-Jesse


On Mon, Dec 23, 2013 at 10:01 AM, James Taylor<

[email protected]>

wrote:

   Henning,

Jesse Yates wrote the back-end of our global secondary indexing

system

in

Phoenix. He designed it as a separate, pluggable module with no

Phoenix

dependencies. Here's an overview of the feature:

https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing. The
section that discusses the data guarantees and failure management

might

be

of interest to you:

  https://github.com/forcedotcom/phoenix/wiki/

Secondary-Indexing#data-

guarantees-and-failure-management

This presentation also gives a good overview of the pluggability of

his

--
Henning Blohm

*ZFabrik Software KG*

T:      +49 6227 3984255
F:      +49 6227 3984254
M:      +49 1781891820

Lammstrasse 2 69190 Walldorf

[email protected] <mailto:[email protected]>
Linkedin <http://www.linkedin.com/pub/henning-blohm/0/7b5/628>
ZFabrik <http://www.zfabrik.de>
Blog <http://www.z2-environment.net/blog>
Z2-Environment <http://www.z2-environment.eu>
Z2 Wiki <http://redmine.z2-environment.net>



--
Henning Blohm

*ZFabrik Software KG*

T:      +49 6227 3984255
F:      +49 6227 3984254
M:      +49 1781891820

Lammstrasse 2 69190 Walldorf

[email protected] <mailto:[email protected]>
Linkedin <http://www.linkedin.com/pub/henning-blohm/0/7b5/628>
ZFabrik <http://www.zfabrik.de>
Blog <http://www.z2-environment.net/blog>
Z2-Environment <http://www.z2-environment.eu>
Z2 Wiki <http://redmine.z2-environment.net>

Re: secondary index feature

Reply via email to