Thanks James! I have some Phoenix specific questions. I suppose the Phoenix group is a better place to discuss those though.

Henning

On 03.01.2014 22:34, James Taylor wrote:
No worries, Henning. It's a little deceiving, because the coprocessors that
do the index maintenance are invoked on a per region basis. However, the
writes/puts that they do for the maintenance end up going over the wire if
necessary.

Let me know if you have other questions. It'd be good to understand your
use case more to see if Phoenix is a good fit - we're definitely open to
collaborating. FYI, we're in the process of moving to Apache, so will keep
you posted once the transition is complete.

Thanks,

James


On Fri, Jan 3, 2014 at 1:11 PM, Henning Blohm <[email protected]>wrote:

Hi James,

this is a little embarassing... I even browsed through the code and read
it as implementing a region level index.

But now at least I get the restrictions mentioned for using the covered
indexes.

Thanks for clarifying. Guess I need to browse the code a little harder ;-)

Henning


On 03.01.2014 21:53, James Taylor wrote:

Hi Henning,
Phoenix maintains a global index. It is essentially maintaining another
HBase table for you with a different row key (and a subset of your data
table columns that are "covered"). When an index is used by Phoenix, it is
*exactly* like querying a data table (that's what Phoenix does - it ends
up
issuing a Phoenix query against a Phoenix table that happens to be an
index
table).

The hit you take for a global index is at write time - we need to look up
the prior state of the rows being updated to do the index maintenance.
Then
we need to do a write to the index table. The upside is that there's no
hit
at read/query time (we don't yet attempt to join from the index table back
to the data table - if a query is using columns that aren't in the index,
it simply won't be used). More here:
https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing

Thanks,
James


On Fri, Jan 3, 2014 at 12:46 PM, Henning Blohm <[email protected]>
wrote:

  When scanning in order of an index and you use RLI, it seems, there is no
alternative but to involve all regions - and essentially this should
happen
in parallel as otherwise you might not get what you wanted. Also, for a
single Get, it seems (as Lars pointed out in https://issues.apache.org/
jira/browse/HBASE-2038) that you have to consult all regions.

When that parallelism is no problem (small number of servers) it will
actually help single scan performance (regions can provide their share in
parallel).

A high number of concurrent client requests leads to the same number of
requests on all regions and its multiple of connections to be maintained
by
the client.

My assumption is that that will eventually lead to a scalability problem
-
when, say, having a 100 region servers or so in place. I was wondering,
if
anyone has experience with that.

That will be perfectly acceptable for many use cases that benefit from
the
scan (and hence query) performance more than they suffer from the load
problem. Other use cases have less requirements on scans and query
flexibility but rather want to preserve the quality that a Get has fixed
resource usage.

Btw.: I was convinces that Phoenix is keeping indexes on the region
level.
Is that not so?

Thanks,
Henning


On 03.01.2014 17:57, Anoop John wrote:

  In case of HBase normal scan as we know, regions will be scanned
sequentially.  Pheonix having parallel scan impls in it.  When RLI is
used
and we make use of index completely at server side, it is irrespective
of
client scan ways. Sequential or parallel, using java or any other client
layer or using SQL layer like Phoenix, using MR or not...  all client
side
dont have to worry abt this but the usage will be fully at server end.

Yes when parallel scan is done on regions, RLI might perform much
better.

-Anoop-

On Fri, Jan 3, 2014 at 7:35 PM, rajeshbabu chintaguntla <
[email protected]> wrote:

   No. the regions scanned sequentially.

________________________________________
From: Asaf Mesika [[email protected]]
Sent: Friday, January 03, 2014 7:26 PM
To: [email protected]
    Subject: Re: secondary index feature

Are the regions scanned in parallel?

On Friday, January 3, 2014, rajeshbabu chintaguntla wrote:

   Here are some performance numbers with RLI.

No Region servers : 4
Data per region    : 2 GB

Regions/RS| Total regions|  Blocksize(kb) |No#rows matching values|
Time
taken(sec)|
    50 | 200| 64|199|102
50  | 200|8|199| 35
100|400 | 8| 350| 95
200| 800| 8| 353| 153

Without secondary index scan is taking in hours.


Thanks,
Rajeshbabu
________________________________________
From: Anoop John [[email protected] <javascript:;>]
Sent: Friday, January 03, 2014 3:22 PM
To: [email protected] <javascript:;>
Subject: Re: secondary index feature

   Is there any data on how RLI (or in particular Phoenix) query

throughput

  correlates with the number of region servers assuming homogeneously
distributed data?

Phoenix is yet to add RLI. Now it is having global indexing only.
Correct
James?

RLI impl from Huawei (HIndex) is having some numbers wrt regions..
But I
doubt whether it is there large no# RSs.  Do you have some data Rajesh
Babu?

-Anoop-

On Fri, Jan 3, 2014 at 3:11 PM, Henning Blohm <
[email protected]

  wrote:
Jesse, James, Lars,

after looking around a bit and in particular looking into Phoenix

  (which
I

  find very interesting), assuming that you want a secondary indexing
on
HBASE without adding other infrastructure, there seems to be not a
lot

  of
choice really: Either go with a region-level (and co-processor based)

indexing feature (Phoenix, Huawei, is IHBase dead?) or add an index

  table
to store (index value, entity key) pairs.

The main concern I have with region-level indexing (RLI) is that Gets
potentially require to visit all regions. Compared to global index

  tables
this seems to flatten the read-scalability curve of the cluster. In
our

case, we have a large data set (hence HBASE) that will be queried

  (mostly
point-gets via an index) in some linear correlation with its size.

Is there any data on how RLI (or in particular Phoenix) query

  throughput
correlates with the number of region servers assuming homogeneously

distributed data?

Thanks,
Henning




On 24.12.2013 12:18, Henning Blohm wrote:

     All that sounds very promising. I will give it a try and let you

know
how things worked out.

Thanks,
Henning

On 12/23/2013 08:10 PM, Jesse Yates wrote:

     The work that James is referencing grew out of the discussions
Lars

and I
had (which lead to those blog posts). The solution we implement is
designed
to be generic, as James mentioned above, but was written with all
the
hooks
necessary for Phoenix to do some really fast updates (or skipping

  updates
in the case where there is no change).

You should be able to plug in your own simple index builder (there
is
an example
in the phoenix codebase<https://github.com/
forcedotcom/phoenix/tree/
master/src/main/java/com/salesforce/hbase/index/covered/example>)
to basic solution which supports the same transactional guarantees
as
HBase
(per row) + data guarantees across the index rows. There are more

  details
in the presentations James linked.

I'd love you see if your implementation can fit into the framework
we
wrote
- we would be happy to work to see if it needs some more hooks or
modifications - I have a feeling this is pretty much what you guys

  will
need
-Jesse

On Mon, Dec 23, 2013 at 10:01 AM, James Taylor<

  [email protected]>
wrote:
    Henning,
  Jesse Yates wrote the back-end of our global secondary indexing
  system
in
Phoenix. He designed it as a separate, pluggable module with no
  Phoenix
dependencies. Here's an overview of the feature:
https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing. The
section that discusses the data guarantees and failure management

  might
be
of interest to you:
   https://github.com/forcedotcom/phoenix/wiki/

Secondary-Indexing#data-
  guarantees-and-failure-management
This presentation also gives a good overview of the pluggability of
  his
  --
Henning Blohm

*ZFabrik Software KG*

T:      +49 6227 3984255
F:      +49 6227 3984254
M:      +49 1781891820

Lammstrasse 2 69190 Walldorf

[email protected] <mailto:[email protected]>
Linkedin <http://www.linkedin.com/pub/henning-blohm/0/7b5/628>
ZFabrik <http://www.zfabrik.de>
Blog <http://www.z2-environment.net/blog>
Z2-Environment <http://www.z2-environment.eu>
Z2 Wiki <http://redmine.z2-environment.net>



--
Henning Blohm

*ZFabrik Software KG*

T:      +49 6227 3984255
F:      +49 6227 3984254
M:      +49 1781891820

Lammstrasse 2 69190 Walldorf

[email protected] <mailto:[email protected]>
Linkedin <http://www.linkedin.com/pub/henning-blohm/0/7b5/628>
ZFabrik <http://www.zfabrik.de>
Blog <http://www.z2-environment.net/blog>
Z2-Environment <http://www.z2-environment.eu>
Z2 Wiki <http://redmine.z2-environment.net>




--
Henning Blohm

*ZFabrik Software KG*

T:      +49 6227 3984255
F:      +49 6227 3984254
M:      +49 1781891820

Lammstrasse 2 69190 Walldorf

[email protected] <mailto:[email protected]>
Linkedin <http://www.linkedin.com/pub/henning-blohm/0/7b5/628>
ZFabrik <http://www.zfabrik.de>
Blog <http://www.z2-environment.net/blog>
Z2-Environment <http://www.z2-environment.eu>
Z2 Wiki <http://redmine.z2-environment.net>

Reply via email to