Phoenix and Cloudera

2017-12-27 Thread Dor Ben Dov
Hi,

I am trying with no luck due to many exceptions to connect phoenix to CDH 5.10.1
Can one give some tips, guidance how it should be done ?

I tried with the manual on the Apache site as well as via Cloudera Parcels 
distribution and activation process.

Regards,
Dor Ben Dov
Research Manager, Digital - BSS - Research
+972-9-7764043   (office)
+972-52-3573492 (mobile)
[cid:image001.png@01D28AC9.B559B360]

Follow us on Facebook, 
Twitter, 
LinkedIn, 
YouTube, 
Google+ and the Amdocs blog 
network| Digital BSS 
Research@Yammer

This message and the information contained herein is proprietary and 
confidential and subject to the Amdocs policy statement,

you may review at https://www.amdocs.com/about/email-disclaimer 



Re: Query optimization

2017-12-27 Thread James Taylor
Looks like the second query is sorting the entire PEOPLE table (though it
seems like that shouldn’t be necessary as it’s probably already sorted by
PEOPLE_ID) while the first one is sorting only part of MY_TABLE (which is
likely less data). Might be a bug as the queries look the same.

Please log a JIRA and thanks for all the details.

On Wed, Dec 27, 2017 at 3:44 PM Flavio Pompermaier 
wrote:

> Ok.  So why the 2nd query requires more memory than the first one
> (nonetheless USE_SORT_MERGE_JOIN is used) and can't complete?
>
>
> On 28 Dec 2017 00:33, "James Taylor"  wrote:
>
> A hash join (the default) will be faster but the tables being cached (last
> or RHS table being joined) must be small enough to fit into memory on the
> region server. If it's too big, you can use the USE_SORT_MERGE_JOIN which
> would not have this restriction.
>
> On Wed, Dec 27, 2017 at 3:16 PM, Flavio Pompermaier 
> wrote:
>
>> Just to summarize things...is the best approach, in terms of required
>> memory, for Apache Phoenix queries to use sort merge join? Should inner
>> queries be avoided?
>>
>>
>> On 22 Dec 2017 22:47, "Flavio Pompermaier"  wrote:
>>
>> MYTABLE is definitely much bigger than PEOPLE table, in terms of
>> cardinality. In terms of cells (rows x columns) PEOPLE is probably bigger
>>
>> On 22 Dec 2017 22:36, "Ethan"  wrote:
>>
>>> I see. I think client side probably hold on to the iterators from the
>>> both sides and crawling forward to do the merge sort. in this case should
>>> be no much memory footprint either way where the filter is performed.
>>>
>>> On December 22, 2017 at 1:04:18 PM, James Taylor (jamestay...@apache.org)
>>> wrote:
>>>
>>> There’s no shipping of any tables with a sort merge join.
>>>
>>> On Fri, Dec 22, 2017 at 1:02 PM Ethan Wang  wrote:
>>>
 I see. Looks like it's possible the rhs (MYTABLE) is too big to ship
 around without get filtered first. Just for experiment, if you took out
 hint USE_SORT_MERGE_JOIN, what will be the plan?


 On December 22, 2017 at 12:46:25 PM, James Taylor (
 jamestay...@apache.org) wrote:

 For sort merge join, both post-filtered table results are sorted on the
 server side and then a merge sort is done on the client-side.

 On Fri, Dec 22, 2017 at 12:44 PM, Ethan  wrote:

> Hello Flavio,
>
> From the plan looks like to me the second query is doing the filter at
> parent table (PEOPLE). So what is the size of your PEOPLE and MYTABLE
> (after filtered) respectively?
>
> For sort merge join, anyone knows are the both sides get shipped to
> client to do the merge sort?
>
> Thanks,
>
>
> On December 22, 2017 at 9:58:30 AM, Flavio Pompermaier (
> pomperma...@okkam.it) wrote:
>
> Any help here...?
>
> On 20 Dec 2017 17:58, "Flavio Pompermaier" 
> wrote:
>
>> Hi to all,
>> I'm trying to find the best query for my use case but I found that
>> one version work and the other one does not (unless that I don't apply 
>> some
>> tuning to timeouts etc like explained in [1]).
>>
>> The 2 queries extract the same data but, while the first query
>> terminates the second does not.
>> *PS*:  without the USE_SORT_MERGE_JOIN both queries weren't working
>>
>> SELECT /*+ USE_SORT_MERGE_JOIN */ COUNT(*)
>> FROM PEOPLE ds JOIN MYTABLE l ON ds.PERSON_ID = l.LOCALID
>> WHERE l.EID IS NULL AND l.DSID = 'PEOPLE' AND l.HAS_CANDIDATES =
>> FALSE;
>>
>> +---+-+++
>> | PLAN
>>   | EST_BYTES_READ  |
>> EST_ROWS_READ  |  EST_INFO_TS   |
>>
>> +---+-+++
>> | SORT-MERGE-JOIN (INNER) TABLES
>>   | 14155777900
>> <(415)%20577-7900> | 12077867   | 1513754378759  |
>> | CLIENT 42-CHUNK 6168903 ROWS 1132461 BYTES PARALLEL 3-WAY
>> FULL SCAN OVER PEOPLE | 14155777900
>> <(415)%20577-7900> | 12077867   | 1513754378759  |
>> | SERVER FILTER BY FIRST KEY ONLY
>>| 14155777900
>> <(415)%20577-7900> | 12077867   | 1513754378759  |
>> | CLIENT MERGE SORT
>>| 14155777900
>> <(415)%20577-7900> | 12077867   | 1513754378759  |
>> | AND (SKIP MERGE)
>>   

Re: Query optimization

2017-12-27 Thread Flavio Pompermaier
Ok.  So why the 2nd query requires more memory than the first one
(nonetheless USE_SORT_MERGE_JOIN is used) and can't complete?

On 28 Dec 2017 00:33, "James Taylor"  wrote:

A hash join (the default) will be faster but the tables being cached (last
or RHS table being joined) must be small enough to fit into memory on the
region server. If it's too big, you can use the USE_SORT_MERGE_JOIN which
would not have this restriction.

On Wed, Dec 27, 2017 at 3:16 PM, Flavio Pompermaier 
wrote:

> Just to summarize things...is the best approach, in terms of required
> memory, for Apache Phoenix queries to use sort merge join? Should inner
> queries be avoided?
>
>
> On 22 Dec 2017 22:47, "Flavio Pompermaier"  wrote:
>
> MYTABLE is definitely much bigger than PEOPLE table, in terms of
> cardinality. In terms of cells (rows x columns) PEOPLE is probably bigger
>
> On 22 Dec 2017 22:36, "Ethan"  wrote:
>
>> I see. I think client side probably hold on to the iterators from the
>> both sides and crawling forward to do the merge sort. in this case should
>> be no much memory footprint either way where the filter is performed.
>>
>> On December 22, 2017 at 1:04:18 PM, James Taylor (jamestay...@apache.org)
>> wrote:
>>
>> There’s no shipping of any tables with a sort merge join.
>>
>> On Fri, Dec 22, 2017 at 1:02 PM Ethan Wang  wrote:
>>
>>> I see. Looks like it's possible the rhs (MYTABLE) is too big to ship
>>> around without get filtered first. Just for experiment, if you took out
>>> hint USE_SORT_MERGE_JOIN, what will be the plan?
>>>
>>>
>>> On December 22, 2017 at 12:46:25 PM, James Taylor (
>>> jamestay...@apache.org) wrote:
>>>
>>> For sort merge join, both post-filtered table results are sorted on the
>>> server side and then a merge sort is done on the client-side.
>>>
>>> On Fri, Dec 22, 2017 at 12:44 PM, Ethan  wrote:
>>>
 Hello Flavio,

 From the plan looks like to me the second query is doing the filter at
 parent table (PEOPLE). So what is the size of your PEOPLE and MYTABLE
 (after filtered) respectively?

 For sort merge join, anyone knows are the both sides get shipped to
 client to do the merge sort?

 Thanks,


 On December 22, 2017 at 9:58:30 AM, Flavio Pompermaier (
 pomperma...@okkam.it) wrote:

 Any help here...?

 On 20 Dec 2017 17:58, "Flavio Pompermaier" 
 wrote:

> Hi to all,
> I'm trying to find the best query for my use case but I found that one
> version work and the other one does not (unless that I don't apply some
> tuning to timeouts etc like explained in [1]).
>
> The 2 queries extract the same data but, while the first query
> terminates the second does not.
> *PS*:  without the USE_SORT_MERGE_JOIN both queries weren't working
>
> SELECT /*+ USE_SORT_MERGE_JOIN */ COUNT(*)
> FROM PEOPLE ds JOIN MYTABLE l ON ds.PERSON_ID = l.LOCALID
> WHERE l.EID IS NULL AND l.DSID = 'PEOPLE' AND l.HAS_CANDIDATES = FALSE;
> +---
> +---
> --+++
> | PLAN
>   | EST_BYTES_READ  |
> EST_ROWS_READ  |  EST_INFO_TS   |
> +---
> +---
> --+++
> | SORT-MERGE-JOIN (INNER) TABLES
>   | 14155777900
> <(415)%20577-7900> | 12077867   | 1513754378759  |
> | CLIENT 42-CHUNK 6168903 ROWS 1132461 BYTES PARALLEL 3-WAY
> FULL SCAN OVER PEOPLE | 14155777900 <(415)%20577-7900>
>| 12077867   | 1513754378759  |
> | SERVER FILTER BY FIRST KEY ONLY
>  | 14155777900
> <(415)%20577-7900> | 12077867   | 1513754378759  |
> | CLIENT MERGE SORT
>  | 14155777900
> <(415)%20577-7900> | 12077867   | 1513754378759  |
> | AND (SKIP MERGE)
>   | 14155777900
> <(415)%20577-7900> | 12077867   | 1513754378759  |
> | CLIENT 15-CHUNK 5908964 ROWS 2831155679 BYTES PARALLEL 15-WAY
> RANGE SCAN OVER MYTABLE [0] - [2]  | 14155777900 <(415)%20577-7900>
>| 12077867   | 1513754378759  |
> | SERVER FILTER BY (EID IS NULL AND DSID = 'PEOPLE' AND
> HAS_CANDIDATES = false)   | 14155777900
> <(415)%20577-7900> | 12077867   | 1513754378759  |
> | SERVER SORTED BY [L.LOCALID]
>  

Re: Add automatic/default SALT

2017-12-27 Thread James Taylor
There's some information in our Tuning Guide[1] on recommendations of when
to use or not use salted tables. We don't recommend it unless your table
has a monotonically increasing primary key. Understanding why is best
explained with an example. Let's say you have a table with SALT_BUCKETS=20.
When you execute a simple query against that table that might return 10
contiguous rows, you'll be executing 20 scans instead of just one. Each
scan will open a block on the region server - that's 20 block fetches
versus what would otherwise be a single block fetch (assuming that the 10
rows being returned are in the same block since they're contiguous). The
only time you're not hit with this 20x block fetch cost is if you're doing
a point lookup (as the client can precompute the salt byte in that case).

[1] https://phoenix.apache.org/tuning_guide.html

On Wed, Dec 27, 2017 at 3:26 PM, Flavio Pompermaier 
wrote:

> Hi Josh,
> Thanks for the feedback. Do you have any concrete example where salted
> tables are 'evil'? However I really like the idea to enable salting using
> some predefined variable (like number of region servers or something like
> that).
> An example could be:
>
> SALT_BUCKETS = $REGION_SERVERS_COUNT
>
> Best,
> Flavio
>
>
> On 12 Dec 2017 01:45, "Josh Elser"  wrote:
>
> I'm a little hesitant of this for a few things I've noticed from lots of
> various installations:
>
> * Salted tables are *not* always more efficient. In fact, I've found
> myself giving advice to not use salted tables a bit more than expected.
> Certain kinds of queries will require much more work if you have salting
> over not having salting
>
> * Considering salt buckets as a measure of parallelism for a table, it's
> impossible for the system to correctly judge what the parallelism of the
> cluster should be. For example, with 10 RS and 1 Phoenix table, you would
> want to start with 10 salt buckets. However, with 10 RS and 100 Phoenix
> tables, you'd *maybe* want to do 3 salt buckets. It's hard to make system
> wide decisions correctly without a global view of the entire system.
>
> I think James was trying to capture some of this in his use of "relative
> conservative default", but I'd take that even a bit farther to say I
> consider it harmful for Phoenix to do that out of the box.
>
> However, I would flip the question upside down instead: what kind of
> suggestions can Phoenix make as a database to the user to _recommend_ to
> them that they enable salting on a table given its schema and important
> queries?
>
>
> On 12/8/17 12:34 PM, James Taylor wrote:
>
>> Hi Flavio,
>> I like the idea of “adaptable configuration” where you specify a config
>> value as a % of some cluster resource (with relatively conservative
>> defaults). Salting is somewhat of a gray area though as it’s not config
>> based, but driven by your DDL. One solution you could implement on top of
>> Phoenix is scripting for DDL that fills in the salt bucket parameter based
>> on cluster size.
>> Thanks,
>> James
>>
>> On Tue, Dec 5, 2017 at 12:50 AM Flavio Pompermaier > > wrote:
>>
>> Hi to all,
>> as stated by at the documentation[1] "for optimal performance,
>> number of salt buckets should match number of region servers".
>> So, why not to add an option AUTO/DEFAULT for salting that defaults
>> this parameter to the number of region servers?
>> Otherwise I have to manually connect to HBase, retrieve that number
>> and pass to Phoenix...
>> What do you think?
>>
>> [1] https://phoenix.apache.org/performance.html#Salting
>>
>> Best,
>> Flavio
>>
>>
>


Re: Query optimization

2017-12-27 Thread James Taylor
A hash join (the default) will be faster but the tables being cached (last
or RHS table being joined) must be small enough to fit into memory on the
region server. If it's too big, you can use the USE_SORT_MERGE_JOIN which
would not have this restriction.

On Wed, Dec 27, 2017 at 3:16 PM, Flavio Pompermaier 
wrote:

> Just to summarize things...is the best approach, in terms of required
> memory, for Apache Phoenix queries to use sort merge join? Should inner
> queries be avoided?
>
>
> On 22 Dec 2017 22:47, "Flavio Pompermaier"  wrote:
>
> MYTABLE is definitely much bigger than PEOPLE table, in terms of
> cardinality. In terms of cells (rows x columns) PEOPLE is probably bigger
>
> On 22 Dec 2017 22:36, "Ethan"  wrote:
>
>> I see. I think client side probably hold on to the iterators from the
>> both sides and crawling forward to do the merge sort. in this case should
>> be no much memory footprint either way where the filter is performed.
>>
>> On December 22, 2017 at 1:04:18 PM, James Taylor (jamestay...@apache.org)
>> wrote:
>>
>> There’s no shipping of any tables with a sort merge join.
>>
>> On Fri, Dec 22, 2017 at 1:02 PM Ethan Wang  wrote:
>>
>>> I see. Looks like it's possible the rhs (MYTABLE) is too big to ship
>>> around without get filtered first. Just for experiment, if you took out
>>> hint USE_SORT_MERGE_JOIN, what will be the plan?
>>>
>>>
>>> On December 22, 2017 at 12:46:25 PM, James Taylor (
>>> jamestay...@apache.org) wrote:
>>>
>>> For sort merge join, both post-filtered table results are sorted on the
>>> server side and then a merge sort is done on the client-side.
>>>
>>> On Fri, Dec 22, 2017 at 12:44 PM, Ethan  wrote:
>>>
 Hello Flavio,

 From the plan looks like to me the second query is doing the filter at
 parent table (PEOPLE). So what is the size of your PEOPLE and MYTABLE
 (after filtered) respectively?

 For sort merge join, anyone knows are the both sides get shipped to
 client to do the merge sort?

 Thanks,


 On December 22, 2017 at 9:58:30 AM, Flavio Pompermaier (
 pomperma...@okkam.it) wrote:

 Any help here...?

 On 20 Dec 2017 17:58, "Flavio Pompermaier" 
 wrote:

> Hi to all,
> I'm trying to find the best query for my use case but I found that one
> version work and the other one does not (unless that I don't apply some
> tuning to timeouts etc like explained in [1]).
>
> The 2 queries extract the same data but, while the first query
> terminates the second does not.
> *PS*:  without the USE_SORT_MERGE_JOIN both queries weren't working
>
> SELECT /*+ USE_SORT_MERGE_JOIN */ COUNT(*)
> FROM PEOPLE ds JOIN MYTABLE l ON ds.PERSON_ID = l.LOCALID
> WHERE l.EID IS NULL AND l.DSID = 'PEOPLE' AND l.HAS_CANDIDATES = FALSE;
> +---
> +---
> --+++
> | PLAN
>   | EST_BYTES_READ  |
> EST_ROWS_READ  |  EST_INFO_TS   |
> +---
> +---
> --+++
> | SORT-MERGE-JOIN (INNER) TABLES
>   | 14155777900
> <(415)%20577-7900> | 12077867   | 1513754378759  |
> | CLIENT 42-CHUNK 6168903 ROWS 1132461 BYTES PARALLEL 3-WAY
> FULL SCAN OVER PEOPLE | 14155777900 <(415)%20577-7900>
>| 12077867   | 1513754378759  |
> | SERVER FILTER BY FIRST KEY ONLY
>  | 14155777900
> <(415)%20577-7900> | 12077867   | 1513754378759  |
> | CLIENT MERGE SORT
>  | 14155777900
> <(415)%20577-7900> | 12077867   | 1513754378759  |
> | AND (SKIP MERGE)
>   | 14155777900
> <(415)%20577-7900> | 12077867   | 1513754378759  |
> | CLIENT 15-CHUNK 5908964 ROWS 2831155679 BYTES PARALLEL 15-WAY
> RANGE SCAN OVER MYTABLE [0] - [2]  | 14155777900 <(415)%20577-7900>
>| 12077867   | 1513754378759  |
> | SERVER FILTER BY (EID IS NULL AND DSID = 'PEOPLE' AND
> HAS_CANDIDATES = false)   | 14155777900
> <(415)%20577-7900> | 12077867   | 1513754378759  |
> | SERVER SORTED BY [L.LOCALID]
>   | 14155777900
> <(415)%20577-7900> | 12077867   | 1513754378759  |
> | CLIENT MERGE SORT
>  | 14155777900

Re: Query optimization

2017-12-27 Thread Flavio Pompermaier
Just to summarize things...is the best approach, in terms of required
memory, for Apache Phoenix queries to use sort merge join? Should inner
queries be avoided?

On 22 Dec 2017 22:47, "Flavio Pompermaier"  wrote:

MYTABLE is definitely much bigger than PEOPLE table, in terms of
cardinality. In terms of cells (rows x columns) PEOPLE is probably bigger

On 22 Dec 2017 22:36, "Ethan"  wrote:

> I see. I think client side probably hold on to the iterators from the both
> sides and crawling forward to do the merge sort. in this case should be no
> much memory footprint either way where the filter is performed.
>
> On December 22, 2017 at 1:04:18 PM, James Taylor (jamestay...@apache.org)
> wrote:
>
> There’s no shipping of any tables with a sort merge join.
>
> On Fri, Dec 22, 2017 at 1:02 PM Ethan Wang  wrote:
>
>> I see. Looks like it's possible the rhs (MYTABLE) is too big to ship
>> around without get filtered first. Just for experiment, if you took out
>> hint USE_SORT_MERGE_JOIN, what will be the plan?
>>
>>
>> On December 22, 2017 at 12:46:25 PM, James Taylor (jamestay...@apache.org)
>> wrote:
>>
>> For sort merge join, both post-filtered table results are sorted on the
>> server side and then a merge sort is done on the client-side.
>>
>> On Fri, Dec 22, 2017 at 12:44 PM, Ethan  wrote:
>>
>>> Hello Flavio,
>>>
>>> From the plan looks like to me the second query is doing the filter at
>>> parent table (PEOPLE). So what is the size of your PEOPLE and MYTABLE
>>> (after filtered) respectively?
>>>
>>> For sort merge join, anyone knows are the both sides get shipped to
>>> client to do the merge sort?
>>>
>>> Thanks,
>>>
>>>
>>> On December 22, 2017 at 9:58:30 AM, Flavio Pompermaier (
>>> pomperma...@okkam.it) wrote:
>>>
>>> Any help here...?
>>>
>>> On 20 Dec 2017 17:58, "Flavio Pompermaier"  wrote:
>>>
 Hi to all,
 I'm trying to find the best query for my use case but I found that one
 version work and the other one does not (unless that I don't apply some
 tuning to timeouts etc like explained in [1]).

 The 2 queries extract the same data but, while the first query
 terminates the second does not.
 *PS*:  without the USE_SORT_MERGE_JOIN both queries weren't working

 SELECT /*+ USE_SORT_MERGE_JOIN */ COUNT(*)
 FROM PEOPLE ds JOIN MYTABLE l ON ds.PERSON_ID = l.LOCALID
 WHERE l.EID IS NULL AND l.DSID = 'PEOPLE' AND l.HAS_CANDIDATES = FALSE;
 +---
 +---
 --+++
 | PLAN
 | EST_BYTES_READ  | EST_ROWS_READ
 |  EST_INFO_TS   |
 +---
 +---
 --+++
 | SORT-MERGE-JOIN (INNER) TABLES
 | 14155777900
 <(415)%20577-7900> | 12077867   | 1513754378759  |
 | CLIENT 42-CHUNK 6168903 ROWS 1132461 BYTES PARALLEL 3-WAY
 FULL SCAN OVER PEOPLE | 14155777900 <(415)%20577-7900>
| 12077867   | 1513754378759  |
 | SERVER FILTER BY FIRST KEY ONLY
  | 14155777900
 <(415)%20577-7900> | 12077867   | 1513754378759  |
 | CLIENT MERGE SORT
  | 14155777900
 <(415)%20577-7900> | 12077867   | 1513754378759  |
 | AND (SKIP MERGE)
 | 14155777900
 <(415)%20577-7900> | 12077867   | 1513754378759  |
 | CLIENT 15-CHUNK 5908964 ROWS 2831155679 BYTES PARALLEL 15-WAY
 RANGE SCAN OVER MYTABLE [0] - [2]  | 14155777900 <(415)%20577-7900>
  | 12077867   | 1513754378759  |
 | SERVER FILTER BY (EID IS NULL AND DSID = 'PEOPLE' AND
 HAS_CANDIDATES = false)   | 14155777900
 <(415)%20577-7900> | 12077867   | 1513754378759  |
 | SERVER SORTED BY [L.LOCALID]
 | 14155777900
 <(415)%20577-7900> | 12077867   | 1513754378759  |
 | CLIENT MERGE SORT
  | 14155777900
 <(415)%20577-7900> | 12077867   | 1513754378759  |
 | CLIENT AGGREGATE INTO SINGLE ROW
 | 14155777900
 <(415)%20577-7900> | 12077867   | 1513754378759  |
 +---
 +---
 --+++


 SELECT /*+ USE_SORT_MERGE_JOIN */