Re: Speeding Up Group By Queries

James Taylor Fri, 25 Mar 2016 15:52:04 -0700

Hi Amit,
Using 4.7.0-HBase-1.1 release, I see the index being used for that query
(see below). An index will help some, as the aggregation can be done in
place as the scan over the index is occurring (as opposed to having to hold
the distinct values found during grouping in memory per chunk of work and
sorting each chunk on the client). It's not going to prevent the entire
index from being scanned though. You'll need a WHERE clause to prevent that.


0: jdbc:phoenix:localhost> create table TRANSACTIONS (K integer primary
key, UNIT_CNT_SOLD integer, TOTAL_SALES integer, T_COUNTRY varchar);
No rows affected (1.32 seconds)
0: jdbc:phoenix:localhost> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX ON
TRANSACTIONS (T_COUNTRY) INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES);
No rows affected (6.452 seconds)
0: jdbc:phoenix:localhost> explain SELECT SUM(UNIT_CNT_SOLD),
SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP BY T_COUNTRY;
+--------------------------------------------------------------------------+
|                                   PLAN                                   |
+--------------------------------------------------------------------------+
| CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER TRANSACTIONS_COUNTRY_INDEX  |
|     SERVER AGGREGATE INTO ORDERED DISTINCT ROWS BY ["T_COUNTRY"]         |
| CLIENT MERGE SORT                                                        |
+--------------------------------------------------------------------------+
3 rows selected (0.028 seconds)

Thanks,
James


On Fri, Mar 25, 2016 at 10:37 AM, Mujtaba Chohan <mujt...@apache.org> wrote:

> That seems excessively slow for 10M rows which should be in order of few
> seconds at most without index. 1. How wide is your table 2. How many region
> servers is your data distributed on and what's the heap size? 3. Do you see
> lots of disk I/O on region servers during aggregation? 4. Can you try your
> query after major compacting your table?
>
> Can you also replace log4j.properties with the attached one and reply back
> with phoenix.log created by executing your query in sqlline?
>
> Thanks,
> Mujtaba
>
>
> On Fri, Mar 25, 2016 at 6:56 AM, Amit Shah <amits...@gmail.com> wrote:
>
>> Hi,
>>
>> I am trying to evaluate apache hbase (version 1.0.0) and phoenix
>> (version 4.6) deployed through cloudera for our OLAP workfload. I have a
>> table that has 10 mil rows. I try to execute the below roll up query and it
>> takes around 2 mins to return 1,850 rows.
>>
>> SELECT SUM(UNIT_CNT_SOLD), SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP BY
>> T_COUNTRY;
>>
>> I tried applying the "joining with indices" example given on the website
>> <https://phoenix.apache.org/joins.html> on the TRANSACTIONS table by
>> creating an index on the grouped by column as below but that doesn't help.
>>
>> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX ON TRANSACTIONS (T_COUNTRY)
>> INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES);
>>
>> This index is not getting used when the query is executed. The query plan
>> is as below
>>
>> +------------------------------------------+
>> |                   PLAN                   |
>> +------------------------------------------+
>> | CLIENT 31-CHUNK PARALLEL 31-WAY FULL SCAN OVER TRANSACTIONS |
>> |     SERVER AGGREGATE INTO DISTINCT ROWS BY [T_COUNTRY] |
>> | CLIENT MERGE SORT                        |
>> +------------------------------------------+
>>
>> Theoretically can secondary indexes help improve the performance of group
>> by queries?
>>
>> Any suggestions on what are different options in phoenix I could try out
>> to speed up GROUP BY queries?
>>
>> Thanks,
>> Amit.
>>
>
>

Re: Speeding Up Group By Queries

Reply via email to