Re: Slow query help

Flavio Pompermaier Mon, 19 Mar 2018 10:03:42 -0700

Any insight here..?

On Fri, Mar 16, 2018 at 7:23 PM, Flavio Pompermaier <pomperma...@okkam.it>
wrote:


> Thanks everybody for the help.
> I'm just curios to understand why the first query didn't complete. The
> query is quite complex but the available memory should be more than enough.
>
> MYTABLE has 222,547,674 rows. On Parquet it takes 15 GB, while
> uncompressed (in memory during the download of the data) about 600 GB.
> There are 144,149,639 rows where SOMEFIELD is not null and it has 37,449,596
> distinct values (the table field is an unlimited VARCHAR but all values, if
> any, has a max length of 20 chars).
> HBase has 3 region servers with 64 GB each.
> I've tried to give up to 32 GB of memory but the "sloppy" query wasn't
> able to finish.
> I thought that the count of distinct values on a single field shouldn't be
> that heavy in the end (despite the source table is very big).
> Am I wrong?
>
> Any help is welcome.
> Flavio
>
> On Fri, Mar 16, 2018 at 7:08 PM, Samarth Jain <samarth.j...@gmail.com>
> wrote:
>
>> A less resource intensive approach would be to use approx count distinct
>> - https://phoenix.apache.org/language/functions.html#approx_
>> count_distinct
>>
>> You would still need the secondary index though, as James suggested, if
>> you want it to run fast.
>>
>> On Fri, Mar 16, 2018 at 10:26 AM Flavio Pompermaier <pomperma...@okkam.it>
>> wrote:
>>
>>> Thanks for the tip James. I didn't know that syntax for doing the count
>>> on a distinct value!
>>> This version is able to end, the first one wasn't able to finish even
>>> giving a huge amount of memory to HBase (the cardinality of SOMEFIELD is
>>> very big indeed).
>>>
>>> Thanks a lot,
>>> Flavio
>>>
>>> On Fri, Mar 16, 2018 at 5:45 PM, James Taylor <jamestay...@apache.org>
>>> wrote:
>>>
>>>> Hi Flavio,
>>>> You'll need to add a secondary index to SOMEFIELD (or SOMEFIELD +
>>>> VALID) to speed that up. You can write it more simply as SELECT
>>>> COUNT(DISTINCT SOMEFIELD) FROM TEST.MYTABLE WHERE VALID AND SOMEFIELD IS
>>>> NOT NULL.
>>>>
>>>> Otherwise, you'll end up doing a full table scan (and use a fair amount
>>>> of memory depending on the cardinality of SOMEFIELD). The above with a
>>>> secondary index would skip to the distinct values instead.
>>>>
>>>> Thanks,
>>>> James
>>>>
>>>> On Fri, Mar 16, 2018 at 8:30 AM, Flavio Pompermaier <
>>>> pomperma...@okkam.it> wrote:
>>>>
>>>>> Hi to all,
>>>>> I'm running a query like this one on my Phoenix 4.13 (on CDH 5.11.2):
>>>>>
>>>>> SELECT COUNT(*) FROM (
>>>>>   SELECT DISTINCT(SOMEFIELD)
>>>>>   FROM TEST.MYTABLE
>>>>>   WHERE VALID = TRUE AND SOMEFIELD IS NOT NULL
>>>>> )
>>>>>
>>>>> Unfortunately the query timeouts (timeout is 10 min).... Any
>>>>> suggestion about how to tune my installation?
>>>>>
>>>>> Best,
>>>>> Flavi
>>>>>
>>>>
>>>
>

Re: Slow query help

Reply via email to