Re: secondary index feature

Henning Blohm Sat, 04 Jan 2014 10:32:54 -0800

Thanks James! I have some Phoenix specific questions. I suppose thePhoenix group is a better place to discuss those though.


Henning


On 03.01.2014 22:34, James Taylor wrote:

No worries, Henning. It's a little deceiving, because the coprocessors that
do the index maintenance are invoked on a per region basis. However, the
writes/puts that they do for the maintenance end up going over the wire if
necessary.

Let me know if you have other questions. It'd be good to understand your
use case more to see if Phoenix is a good fit - we're definitely open to
collaborating. FYI, we're in the process of moving to Apache, so will keep
you posted once the transition is complete.

Thanks,

James


On Fri, Jan 3, 2014 at 1:11 PM, Henning Blohm <[email protected]>wrote:

Hi James,

this is a little embarassing... I even browsed through the code and read
it as implementing a region level index.

But now at least I get the restrictions mentioned for using the covered
indexes.

Thanks for clarifying. Guess I need to browse the code a little harder ;-)

Henning


On 03.01.2014 21:53, James Taylor wrote:

Hi Henning,
Phoenix maintains a global index. It is essentially maintaining another
HBase table for you with a different row key (and a subset of your data
table columns that are "covered"). When an index is used by Phoenix, it is
*exactly* like querying a data table (that's what Phoenix does - it ends
up
issuing a Phoenix query against a Phoenix table that happens to be an
index
table).

The hit you take for a global index is at write time - we need to look up
the prior state of the rows being updated to do the index maintenance.
Then
we need to do a write to the index table. The upside is that there's no
hit
at read/query time (we don't yet attempt to join from the index table back
to the data table - if a query is using columns that aren't in the index,
it simply won't be used). More here:
https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing

Thanks,
James


On Fri, Jan 3, 2014 at 12:46 PM, Henning Blohm <[email protected]>
wrote:

  When scanning in order of an index and you use RLI, it seems, there is no

alternative but to involve all regions - and essentially this should
happen
in parallel as otherwise you might not get what you wanted. Also, for a
single Get, it seems (as Lars pointed out in https://issues.apache.org/
jira/browse/HBASE-2038) that you have to consult all regions.

When that parallelism is no problem (small number of servers) it will
actually help single scan performance (regions can provide their share in
parallel).

A high number of concurrent client requests leads to the same number of
requests on all regions and its multiple of connections to be maintained
by
the client.

My assumption is that that will eventually lead to a scalability problem
-
when, say, having a 100 region servers or so in place. I was wondering,
if
anyone has experience with that.

That will be perfectly acceptable for many use cases that benefit from
the
scan (and hence query) performance more than they suffer from the load
problem. Other use cases have less requirements on scans and query
flexibility but rather want to preserve the quality that a Get has fixed
resource usage.

Btw.: I was convinces that Phoenix is keeping indexes on the region
level.
Is that not so?

Thanks,
Henning


On 03.01.2014 17:57, Anoop John wrote:

  In case of HBase normal scan as we know, regions will be scanned

sequentially.  Pheonix having parallel scan impls in it.  When RLI is
used
and we make use of index completely at server side, it is irrespective
of
client scan ways. Sequential or parallel, using java or any other client
layer or using SQL layer like Phoenix, using MR or not...  all client
side
dont have to worry abt this but the usage will be fully at server end.

Yes when parallel scan is done on regions, RLI might perform much
better.

-Anoop-

On Fri, Jan 3, 2014 at 7:35 PM, rajeshbabu chintaguntla <
[email protected]> wrote:

   No. the regions scanned sequentially.

________________________________________
From: Asaf Mesika [[email protected]]
Sent: Friday, January 03, 2014 7:26 PM
To: [email protected]
    Subject: Re: secondary index feature

Are the regions scanned in parallel?

On Friday, January 3, 2014, rajeshbabu chintaguntla wrote:

   Here are some performance numbers with RLI.

No Region servers : 4
Data per region    : 2 GB

Regions/RS| Total regions|  Blocksize(kb) |No#rows matching values|
Time
taken(sec)|
    50 | 200| 64|199|102
50  | 200|8|199| 35
100|400 | 8| 350| 95
200| 800| 8| 353| 153

Without secondary index scan is taking in hours.


Thanks,
Rajeshbabu
________________________________________
From: Anoop John [[email protected] <javascript:;>]
Sent: Friday, January 03, 2014 3:22 PM
To: [email protected] <javascript:;>
Subject: Re: secondary index feature

   Is there any data on how RLI (or in particular Phoenix) query

throughput

  correlates with the number of region servers assuming homogeneously

distributed data?

Phoenix is yet to add RLI. Now it is having global indexing only.
Correct
James?

RLI impl from Huawei (HIndex) is having some numbers wrt regions..
But I
doubt whether it is there large no# RSs.  Do you have some data Rajesh
Babu?

-Anoop-

On Fri, Jan 3, 2014 at 3:11 PM, Henning Blohm <
[email protected]

  wrote:

Jesse, James, Lars,

after looking around a bit and in particular looking into Phoenix

  (which

I

  find very interesting), assuming that you want a secondary indexing

on
HBASE without adding other infrastructure, there seems to be not a
lot

  of

choice really: Either go with a region-level (and co-processor based)

indexing feature (Phoenix, Huawei, is IHBase dead?) or add an index

  table

to store (index value, entity key) pairs.

The main concern I have with region-level indexing (RLI) is that Gets
potentially require to visit all regions. Compared to global index

  tables

this seems to flatten the read-scalability curve of the cluster. In
our

case, we have a large data set (hence HBASE) that will be queried

  (mostly

point-gets via an index) in some linear correlation with its size.

Is there any data on how RLI (or in particular Phoenix) query

  throughput

correlates with the number of region servers assuming homogeneously

distributed data?

Thanks,
Henning




On 24.12.2013 12:18, Henning Blohm wrote:

     All that sounds very promising. I will give it a try and let you

know
how things worked out.

Thanks,
Henning

On 12/23/2013 08:10 PM, Jesse Yates wrote:

     The work that James is referencing grew out of the discussions
Lars

and I
had (which lead to those blog posts). The solution we implement is
designed
to be generic, as James mentioned above, but was written with all
the
hooks
necessary for Phoenix to do some really fast updates (or skipping

  updates

in the case where there is no change).

You should be able to plug in your own simple index builder (there

is
an example
in the phoenix codebase<https://github.com/
forcedotcom/phoenix/tree/
master/src/main/java/com/salesforce/hbase/index/covered/example>)
to basic solution which supports the same transactional guarantees
as
HBase
(per row) + data guarantees across the index rows. There are more

  details

in the presentations James linked.

I'd love you see if your implementation can fit into the framework

we
wrote
- we would be happy to work to see if it needs some more hooks or
modifications - I have a feeling this is pretty much what you guys

  will

need
-Jesse


On Mon, Dec 23, 2013 at 10:01 AM, James Taylor<

  [email protected]>

wrote:
    Henning,

  Jesse Yates wrote the back-end of our global secondary indexing

  system

in

Phoenix. He designed it as a separate, pluggable module with no

  Phoenix

dependencies. Here's an overview of the feature:
https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing. The

section that discusses the data guarantees and failure management

  might

be
of interest to you:

   https://github.com/forcedotcom/phoenix/wiki/

Secondary-Indexing#data-

  guarantees-and-failure-management

This presentation also gives a good overview of the pluggability of

his

--

Henning Blohm

*ZFabrik Software KG*

T:      +49 6227 3984255
F:      +49 6227 3984254
M:      +49 1781891820

Lammstrasse 2 69190 Walldorf

[email protected] <mailto:[email protected]>
Linkedin <http://www.linkedin.com/pub/henning-blohm/0/7b5/628>
ZFabrik <http://www.zfabrik.de>
Blog <http://www.z2-environment.net/blog>
Z2-Environment <http://www.z2-environment.eu>
Z2 Wiki <http://redmine.z2-environment.net>

--
Henning Blohm

*ZFabrik Software KG*

T:      +49 6227 3984255
F:      +49 6227 3984254
M:      +49 1781891820

Lammstrasse 2 69190 Walldorf

[email protected] <mailto:[email protected]>
Linkedin <http://www.linkedin.com/pub/henning-blohm/0/7b5/628>
ZFabrik <http://www.zfabrik.de>
Blog <http://www.z2-environment.net/blog>
Z2-Environment <http://www.z2-environment.eu>
Z2 Wiki <http://redmine.z2-environment.net>



--
Henning Blohm

*ZFabrik Software KG*

T:      +49 6227 3984255
F:      +49 6227 3984254
M:      +49 1781891820

Lammstrasse 2 69190 Walldorf

[email protected] <mailto:[email protected]>
Linkedin <http://www.linkedin.com/pub/henning-blohm/0/7b5/628>
ZFabrik <http://www.zfabrik.de>
Blog <http://www.z2-environment.net/blog>
Z2-Environment <http://www.z2-environment.eu>
Z2 Wiki <http://redmine.z2-environment.net>

Re: secondary index feature

Reply via email to