Re: Slow query help

2018-03-19 Thread Flavio Pompermaier
Any insight here..?

On Fri, Mar 16, 2018 at 7:23 PM, Flavio Pompermaier 
wrote:

> Thanks everybody for the help.
> I'm just curios to understand why the first query didn't complete. The
> query is quite complex but the available memory should be more than enough.
>
> MYTABLE has 222,547,674 rows. On Parquet it takes 15 GB, while
> uncompressed (in memory during the download of the data) about 600 GB.
> There are 144,149,639 rows where SOMEFIELD is not null and it has 37,449,596
> distinct values (the table field is an unlimited VARCHAR but all values, if
> any, has a max length of 20 chars).
> HBase has 3 region servers with 64 GB each.
> I've tried to give up to 32 GB of memory but the "sloppy" query wasn't
> able to finish.
> I thought that the count of distinct values on a single field shouldn't be
> that heavy in the end (despite the source table is very big).
> Am I wrong?
>
> Any help is welcome.
> Flavio
>
> On Fri, Mar 16, 2018 at 7:08 PM, Samarth Jain 
> wrote:
>
>> A less resource intensive approach would be to use approx count distinct
>> - https://phoenix.apache.org/language/functions.html#approx_
>> count_distinct
>>
>> You would still need the secondary index though, as James suggested, if
>> you want it to run fast.
>>
>> On Fri, Mar 16, 2018 at 10:26 AM Flavio Pompermaier 
>> wrote:
>>
>>> Thanks for the tip James. I didn't know that syntax for doing the count
>>> on a distinct value!
>>> This version is able to end, the first one wasn't able to finish even
>>> giving a huge amount of memory to HBase (the cardinality of SOMEFIELD is
>>> very big indeed).
>>>
>>> Thanks a lot,
>>> Flavio
>>>
>>> On Fri, Mar 16, 2018 at 5:45 PM, James Taylor 
>>> wrote:
>>>
 Hi Flavio,
 You'll need to add a secondary index to SOMEFIELD (or SOMEFIELD +
 VALID) to speed that up. You can write it more simply as SELECT
 COUNT(DISTINCT SOMEFIELD) FROM TEST.MYTABLE WHERE VALID AND SOMEFIELD IS
 NOT NULL.

 Otherwise, you'll end up doing a full table scan (and use a fair amount
 of memory depending on the cardinality of SOMEFIELD). The above with a
 secondary index would skip to the distinct values instead.

 Thanks,
 James

 On Fri, Mar 16, 2018 at 8:30 AM, Flavio Pompermaier <
 pomperma...@okkam.it> wrote:

> Hi to all,
> I'm running a query like this one on my Phoenix 4.13 (on CDH 5.11.2):
>
> SELECT COUNT(*) FROM (
>   SELECT DISTINCT(SOMEFIELD)
>   FROM TEST.MYTABLE
>   WHERE VALID = TRUE AND SOMEFIELD IS NOT NULL
> )
>
> Unfortunately the query timeouts (timeout is 10 min) Any
> suggestion about how to tune my installation?
>
> Best,
> Flavi
>

>>>
>


Re: Slow query help

2018-03-16 Thread Flavio Pompermaier
Thanks everybody for the help.
I'm just curios to understand why the first query didn't complete. The
query is quite complex but the available memory should be more than enough.

MYTABLE has 222,547,674 rows. On Parquet it takes 15 GB, while uncompressed
(in memory during the download of the data) about 600 GB.
There are 144,149,639 rows where SOMEFIELD is not null and it has 37,449,596
distinct values (the table field is an unlimited VARCHAR but all values, if
any, has a max length of 20 chars).
HBase has 3 region servers with 64 GB each.
I've tried to give up to 32 GB of memory but the "sloppy" query wasn't able
to finish.
I thought that the count of distinct values on a single field shouldn't be
that heavy in the end (despite the source table is very big).
Am I wrong?

Any help is welcome.
Flavio

On Fri, Mar 16, 2018 at 7:08 PM, Samarth Jain 
wrote:

> A less resource intensive approach would be to use approx count distinct -
> https://phoenix.apache.org/language/functions.html#approx_count_distinct
>
> You would still need the secondary index though, as James suggested, if
> you want it to run fast.
>
> On Fri, Mar 16, 2018 at 10:26 AM Flavio Pompermaier 
> wrote:
>
>> Thanks for the tip James. I didn't know that syntax for doing the count
>> on a distinct value!
>> This version is able to end, the first one wasn't able to finish even
>> giving a huge amount of memory to HBase (the cardinality of SOMEFIELD is
>> very big indeed).
>>
>> Thanks a lot,
>> Flavio
>>
>> On Fri, Mar 16, 2018 at 5:45 PM, James Taylor 
>> wrote:
>>
>>> Hi Flavio,
>>> You'll need to add a secondary index to SOMEFIELD (or SOMEFIELD + VALID)
>>> to speed that up. You can write it more simply as SELECT COUNT(DISTINCT
>>> SOMEFIELD) FROM TEST.MYTABLE WHERE VALID AND SOMEFIELD IS NOT NULL.
>>>
>>> Otherwise, you'll end up doing a full table scan (and use a fair amount
>>> of memory depending on the cardinality of SOMEFIELD). The above with a
>>> secondary index would skip to the distinct values instead.
>>>
>>> Thanks,
>>> James
>>>
>>> On Fri, Mar 16, 2018 at 8:30 AM, Flavio Pompermaier <
>>> pomperma...@okkam.it> wrote:
>>>
 Hi to all,
 I'm running a query like this one on my Phoenix 4.13 (on CDH 5.11.2):

 SELECT COUNT(*) FROM (
   SELECT DISTINCT(SOMEFIELD)
   FROM TEST.MYTABLE
   WHERE VALID = TRUE AND SOMEFIELD IS NOT NULL
 )

 Unfortunately the query timeouts (timeout is 10 min) Any suggestion
 about how to tune my installation?

 Best,
 Flavi

>>>
>>


Re: Slow query help

2018-03-16 Thread Samarth Jain
A less resource intensive approach would be to use approx count distinct -
https://phoenix.apache.org/language/functions.html#approx_count_distinct

You would still need the secondary index though, as James suggested, if you
want it to run fast.

On Fri, Mar 16, 2018 at 10:26 AM Flavio Pompermaier 
wrote:

> Thanks for the tip James. I didn't know that syntax for doing the count on
> a distinct value!
> This version is able to end, the first one wasn't able to finish even
> giving a huge amount of memory to HBase (the cardinality of SOMEFIELD is
> very big indeed).
>
> Thanks a lot,
> Flavio
>
> On Fri, Mar 16, 2018 at 5:45 PM, James Taylor 
> wrote:
>
>> Hi Flavio,
>> You'll need to add a secondary index to SOMEFIELD (or SOMEFIELD + VALID)
>> to speed that up. You can write it more simply as SELECT COUNT(DISTINCT
>> SOMEFIELD) FROM TEST.MYTABLE WHERE VALID AND SOMEFIELD IS NOT NULL.
>>
>> Otherwise, you'll end up doing a full table scan (and use a fair amount
>> of memory depending on the cardinality of SOMEFIELD). The above with a
>> secondary index would skip to the distinct values instead.
>>
>> Thanks,
>> James
>>
>> On Fri, Mar 16, 2018 at 8:30 AM, Flavio Pompermaier > > wrote:
>>
>>> Hi to all,
>>> I'm running a query like this one on my Phoenix 4.13 (on CDH 5.11.2):
>>>
>>> SELECT COUNT(*) FROM (
>>>   SELECT DISTINCT(SOMEFIELD)
>>>   FROM TEST.MYTABLE
>>>   WHERE VALID = TRUE AND SOMEFIELD IS NOT NULL
>>> )
>>>
>>> Unfortunately the query timeouts (timeout is 10 min) Any suggestion
>>> about how to tune my installation?
>>>
>>> Best,
>>> Flavi
>>>
>>
>


Re: Slow query help

2018-03-16 Thread Flavio Pompermaier
Thanks for the tip James. I didn't know that syntax for doing the count on
a distinct value!
This version is able to end, the first one wasn't able to finish even
giving a huge amount of memory to HBase (the cardinality of SOMEFIELD is
very big indeed).

Thanks a lot,
Flavio

On Fri, Mar 16, 2018 at 5:45 PM, James Taylor 
wrote:

> Hi Flavio,
> You'll need to add a secondary index to SOMEFIELD (or SOMEFIELD + VALID)
> to speed that up. You can write it more simply as SELECT COUNT(DISTINCT
> SOMEFIELD) FROM TEST.MYTABLE WHERE VALID AND SOMEFIELD IS NOT NULL.
>
> Otherwise, you'll end up doing a full table scan (and use a fair amount of
> memory depending on the cardinality of SOMEFIELD). The above with a
> secondary index would skip to the distinct values instead.
>
> Thanks,
> James
>
> On Fri, Mar 16, 2018 at 8:30 AM, Flavio Pompermaier 
> wrote:
>
>> Hi to all,
>> I'm running a query like this one on my Phoenix 4.13 (on CDH 5.11.2):
>>
>> SELECT COUNT(*) FROM (
>>   SELECT DISTINCT(SOMEFIELD)
>>   FROM TEST.MYTABLE
>>   WHERE VALID = TRUE AND SOMEFIELD IS NOT NULL
>> )
>>
>> Unfortunately the query timeouts (timeout is 10 min) Any suggestion
>> about how to tune my installation?
>>
>> Best,
>> Flavi
>>
>


Re: Slow query help

2018-03-16 Thread James Taylor
Hi Flavio,
You'll need to add a secondary index to SOMEFIELD (or SOMEFIELD + VALID) to
speed that up. You can write it more simply as SELECT COUNT(DISTINCT
SOMEFIELD) FROM TEST.MYTABLE WHERE VALID AND SOMEFIELD IS NOT NULL.

Otherwise, you'll end up doing a full table scan (and use a fair amount of
memory depending on the cardinality of SOMEFIELD). The above with a
secondary index would skip to the distinct values instead.

Thanks,
James

On Fri, Mar 16, 2018 at 8:30 AM, Flavio Pompermaier 
wrote:

> Hi to all,
> I'm running a query like this one on my Phoenix 4.13 (on CDH 5.11.2):
>
> SELECT COUNT(*) FROM (
>   SELECT DISTINCT(SOMEFIELD)
>   FROM TEST.MYTABLE
>   WHERE VALID = TRUE AND SOMEFIELD IS NOT NULL
> )
>
> Unfortunately the query timeouts (timeout is 10 min) Any suggestion
> about how to tune my installation?
>
> Best,
> Flavi
>


Slow query help

2018-03-16 Thread Flavio Pompermaier
Hi to all,
I'm running a query like this one on my Phoenix 4.13 (on CDH 5.11.2):

SELECT COUNT(*) FROM (
  SELECT DISTINCT(SOMEFIELD)
  FROM TEST.MYTABLE
  WHERE VALID = TRUE AND SOMEFIELD IS NOT NULL
)

Unfortunately the query timeouts (timeout is 10 min) Any suggestion
about how to tune my installation?

Best,
Flavi