Re: Speeding Up Group By Queries

2017-05-25 Thread cmbendre
Hi,

I am observing a similar behavior.I am doing POC of Phoenix since our query
workloads are a mix of point lookups and aggregations.

As i can see, Phoenix performs well on point lookups based on either PK or
secondary index. But when it comes to aggregations, it has to do a full scan
and its slow.

Our productions setting will have 500 million - 1 billion rows. And it will
grows over the time. I tried all the the tuning in this thread but still
group by queries are slow. ( Although this is a small cluster with 2
regionServers with 6 GB RAM and 4 cores each)

Is increasing number of machines the only choice we have if we want to
support sub-second group by queries ?

Thanks
Chaitanya



--
View this message in context: 
http://apache-phoenix-user-list.1124778.n5.nabble.com/Speeding-Up-Group-By-Queries-tp1434p3589.html
Sent from the Apache Phoenix User List mailing list archive at Nabble.com.


Re: Speeding Up Group By Queries

2016-04-12 Thread James Taylor
Will the 10-100 million records get larger? Or is that just for a single
user and you plan to have many users? If not, have you considered using a
standard RDBMS like MySQL, Postgres, or Maria DB?

Thanks,
James

On Tuesday, April 12, 2016, Amit Shah  wrote:

> Thanks James for the reply. Please see my comments below
>
> Secondary indexes[1] on the non-primary key columns is the way to improve
>> performance for these case. Take a look at this[2] presentation for more
>> detail.
>
>
> I have done a brief reading on secondary indexes and I will go through the
> video for a detailed study. I understand that they can help out in
> improving the performance but with the OLAP workload that we plan to use
> hbase + phoenix for probably secondary indexes will not be a good solution.
> With OLAP, we could not have a control over the queries that are executed
> during interactive analysis. Let me know if you have thoughts on this.
>
>  I'd recommend at least 6 nodes and 10-20 nodes would be better in your
>> test environment
>
>
> We are in a proof of concept phase and have not yet finalized on how big
> the hbase cluster would be in production. Initially we plan to start with
> 4-5 nodes. Our data size would be in 10-100 million records (not in
> billions for sure). Do you see a reason we should increase the cluster size?
>
>
>> Have you seen our Pherf tool[3] that will help you benchmark your queries
>> under representative data sizes?
>
>
> I will look at the tool.
>
> Thanks again for sharing your inputs.
>
> On Mon, Apr 11, 2016 at 9:29 PM, James Taylor  > wrote:
>
>> Hi Amit,
>>
>> If a query doesn't filter on the primary key columns, the entire table
>> must be scanned (hence it'll be slower). Secondary indexes[1] on the
>> non-primary key columns is the way to improve performance for these case.
>> Take a look at this[2] presentation for more detail.
>>
>> Also, a 3 node cluster is not ideal for benchmarking, but it might be
>> fine to just play around with a bit. There's always overhead with a
>> distributed system that you won't see in a more typical single node RDBMS.
>> If you can mimic your cluster size you'll use in production, that'd be
>> ideal . Otherwise, I'd recommend at least 6 nodes and 10-20 nodes would be
>> better in your test environment. Have you seen our Pherf tool[3] that will
>> help you benchmark your queries under representative data sizes?
>>
>> Thanks,
>> James
>>
>> [1] https://phoenix.apache.org/secondary_indexing.html
>> [2] https://www.youtube.com/watch?v=f4Nmh5KM6gI=youtu.be
>> [3] https://phoenix.apache.org/pherf.html
>>
>> On Mon, Apr 11, 2016 at 6:37 AM, Amit Shah > > wrote:
>>
>>> Hi Mujtaba,
>>>
>>> I observed that if the where-clause and group-by queries are applied on
>>> the primary key columns, then they are superfast (~ 200 ms). This is not
>>> the case with queries that have non-primary key columns in the where clause
>>> and group by queries. I tried configuring the bucket cache but surprisingly
>>> it doesn't give much speed improvement which I had thought of. One of these
>>> queries are taking 26 secs and 31 secs on a 3 node Hbase cluster with and
>>> without bucket cache respectively.
>>>
>>> Let me know if you have suggestions that we could try out.
>>>
>>> Regards,
>>> Amit.
>>>
>>> On Tue, Mar 29, 2016 at 10:55 PM, Amit Shah >> > wrote:
>>>
 Hi Mujtaba,

 Could these improvements be because of region distribution across
 region servers? Along with the optimizations you had suggested I had also
 used hbase-region-inspector to move regions evenly across the region 
 server.

 Below is the table schema for the TRANSACTIONS table


 CREATE TABLE TRANSACTIONS_TEST (
 AGENT_ID BIGINT
 ,A_NAME VARCHAR
 ,A_ROLE VARCHAR
 ,TERRITORY_ID BIGINT
 ,T_ZIPCODE BIGINT
 ,T_PLACE_NAME VARCHAR
 ,GRPBYCF.T_STATE VARCHAR
 ,GRPBYCF.T_COUNTRY VARCHAR
 ,PRODUCT_ID BIGINT NOT NULL
 ,P_NAME VARCHAR
 ,P_CATEGORY VARCHAR
 ,CHANNEL_ID BIGINT
 ,CH_NAME VARCHAR
 ,CH_TYPE VARCHAR
 ,CUSTOMER_ID BIGINT NOT NULL
 ,CS_NAME VARCHAR
 ,CS_TYPE VARCHAR
 ,IS_NEW_CUSTOMER BOOLEAN
 ,CLOSE_DATE DATE
 ,DAY_CNT_SPENT INTEGER
 ,TOTAL_EXPENSES BIGINT
 ,FORCASTED_SALES BIGINT
 ,GRPBYCF.UNIT_CNT_SOLD BIGINT
 ,PRICE_PER_UNIT BIGINT
 ,DISCOUNT_PERCENT BIGINT
 ,GRPBYCF.TOTAL_SALES BIGINT
 ,CONSTRAINT pk PRIMARY KEY (PRODUCT_ID, CUSTOMER_ID)
 ) COMPRESSION='SNAPPY';

 I will try out the guidepost width reduction and let you know the
 results.

 Thank you,
 Amit.

 On Tue, Mar 29, 2016 at 10:50 PM, Mujtaba Chohan 

Re: Speeding Up Group By Queries

2016-04-12 Thread Amit Shah
Thanks James for the reply. Please see my comments below

Secondary indexes[1] on the non-primary key columns is the way to improve
> performance for these case. Take a look at this[2] presentation for more
> detail.


I have done a brief reading on secondary indexes and I will go through the
video for a detailed study. I understand that they can help out in
improving the performance but with the OLAP workload that we plan to use
hbase + phoenix for probably secondary indexes will not be a good solution.
With OLAP, we could not have a control over the queries that are executed
during interactive analysis. Let me know if you have thoughts on this.

 I'd recommend at least 6 nodes and 10-20 nodes would be better in your
> test environment


We are in a proof of concept phase and have not yet finalized on how big
the hbase cluster would be in production. Initially we plan to start with
4-5 nodes. Our data size would be in 10-100 million records (not in
billions for sure). Do you see a reason we should increase the cluster size?


> Have you seen our Pherf tool[3] that will help you benchmark your queries
> under representative data sizes?


I will look at the tool.

Thanks again for sharing your inputs.

On Mon, Apr 11, 2016 at 9:29 PM, James Taylor 
wrote:

> Hi Amit,
>
> If a query doesn't filter on the primary key columns, the entire table
> must be scanned (hence it'll be slower). Secondary indexes[1] on the
> non-primary key columns is the way to improve performance for these case.
> Take a look at this[2] presentation for more detail.
>
> Also, a 3 node cluster is not ideal for benchmarking, but it might be fine
> to just play around with a bit. There's always overhead with a distributed
> system that you won't see in a more typical single node RDBMS. If you can
> mimic your cluster size you'll use in production, that'd be ideal .
> Otherwise, I'd recommend at least 6 nodes and 10-20 nodes would be better
> in your test environment. Have you seen our Pherf tool[3] that will help
> you benchmark your queries under representative data sizes?
>
> Thanks,
> James
>
> [1] https://phoenix.apache.org/secondary_indexing.html
> [2] https://www.youtube.com/watch?v=f4Nmh5KM6gI=youtu.be
> [3] https://phoenix.apache.org/pherf.html
>
> On Mon, Apr 11, 2016 at 6:37 AM, Amit Shah  wrote:
>
>> Hi Mujtaba,
>>
>> I observed that if the where-clause and group-by queries are applied on
>> the primary key columns, then they are superfast (~ 200 ms). This is not
>> the case with queries that have non-primary key columns in the where clause
>> and group by queries. I tried configuring the bucket cache but surprisingly
>> it doesn't give much speed improvement which I had thought of. One of these
>> queries are taking 26 secs and 31 secs on a 3 node Hbase cluster with and
>> without bucket cache respectively.
>>
>> Let me know if you have suggestions that we could try out.
>>
>> Regards,
>> Amit.
>>
>> On Tue, Mar 29, 2016 at 10:55 PM, Amit Shah  wrote:
>>
>>> Hi Mujtaba,
>>>
>>> Could these improvements be because of region distribution across region
>>> servers? Along with the optimizations you had suggested I had also used
>>> hbase-region-inspector to move regions evenly across the region server.
>>>
>>> Below is the table schema for the TRANSACTIONS table
>>>
>>>
>>> CREATE TABLE TRANSACTIONS_TEST (
>>> AGENT_ID BIGINT
>>> ,A_NAME VARCHAR
>>> ,A_ROLE VARCHAR
>>> ,TERRITORY_ID BIGINT
>>> ,T_ZIPCODE BIGINT
>>> ,T_PLACE_NAME VARCHAR
>>> ,GRPBYCF.T_STATE VARCHAR
>>> ,GRPBYCF.T_COUNTRY VARCHAR
>>> ,PRODUCT_ID BIGINT NOT NULL
>>> ,P_NAME VARCHAR
>>> ,P_CATEGORY VARCHAR
>>> ,CHANNEL_ID BIGINT
>>> ,CH_NAME VARCHAR
>>> ,CH_TYPE VARCHAR
>>> ,CUSTOMER_ID BIGINT NOT NULL
>>> ,CS_NAME VARCHAR
>>> ,CS_TYPE VARCHAR
>>> ,IS_NEW_CUSTOMER BOOLEAN
>>> ,CLOSE_DATE DATE
>>> ,DAY_CNT_SPENT INTEGER
>>> ,TOTAL_EXPENSES BIGINT
>>> ,FORCASTED_SALES BIGINT
>>> ,GRPBYCF.UNIT_CNT_SOLD BIGINT
>>> ,PRICE_PER_UNIT BIGINT
>>> ,DISCOUNT_PERCENT BIGINT
>>> ,GRPBYCF.TOTAL_SALES BIGINT
>>> ,CONSTRAINT pk PRIMARY KEY (PRODUCT_ID, CUSTOMER_ID)
>>> ) COMPRESSION='SNAPPY';
>>>
>>> I will try out the guidepost width reduction and let you know the
>>> results.
>>>
>>> Thank you,
>>> Amit.
>>>
>>> On Tue, Mar 29, 2016 at 10:50 PM, Mujtaba Chohan 
>>> wrote:
>>>
 Optimization did help somewhat but not to the extent I was expecting.
 See chart below.

 [image: Inline image 1]

 Can you share your table schema so I can experiment with it? Another
 thing you can try is reducing guidepost
  width for this table by
 executing UPDATE STATISTICS TRANSACTIONS SET
 "phoenix.stats.guidepost.width"=5000;




 On Tue, Mar 29, 2016 at 6:45 AM, Amit Shah  wrote:

> Hi Mujtaba,
>
> I did try the two optimization techniques by recreating the table and
> then 

Re: Speeding Up Group By Queries

2016-04-11 Thread James Taylor
Hi Amit,

If a query doesn't filter on the primary key columns, the entire table must
be scanned (hence it'll be slower). Secondary indexes[1] on the non-primary
key columns is the way to improve performance for these case.  Take a look
at this[2] presentation for more detail.

Also, a 3 node cluster is not ideal for benchmarking, but it might be fine
to just play around with a bit. There's always overhead with a distributed
system that you won't see in a more typical single node RDBMS. If you can
mimic your cluster size you'll use in production, that'd be ideal .
Otherwise, I'd recommend at least 6 nodes and 10-20 nodes would be better
in your test environment. Have you seen our Pherf tool[3] that will help
you benchmark your queries under representative data sizes?

Thanks,
James

[1] https://phoenix.apache.org/secondary_indexing.html
[2] https://www.youtube.com/watch?v=f4Nmh5KM6gI=youtu.be
[3] https://phoenix.apache.org/pherf.html

On Mon, Apr 11, 2016 at 6:37 AM, Amit Shah  wrote:

> Hi Mujtaba,
>
> I observed that if the where-clause and group-by queries are applied on
> the primary key columns, then they are superfast (~ 200 ms). This is not
> the case with queries that have non-primary key columns in the where clause
> and group by queries. I tried configuring the bucket cache but surprisingly
> it doesn't give much speed improvement which I had thought of. One of these
> queries are taking 26 secs and 31 secs on a 3 node Hbase cluster with and
> without bucket cache respectively.
>
> Let me know if you have suggestions that we could try out.
>
> Regards,
> Amit.
>
> On Tue, Mar 29, 2016 at 10:55 PM, Amit Shah  wrote:
>
>> Hi Mujtaba,
>>
>> Could these improvements be because of region distribution across region
>> servers? Along with the optimizations you had suggested I had also used
>> hbase-region-inspector to move regions evenly across the region server.
>>
>> Below is the table schema for the TRANSACTIONS table
>>
>>
>> CREATE TABLE TRANSACTIONS_TEST (
>> AGENT_ID BIGINT
>> ,A_NAME VARCHAR
>> ,A_ROLE VARCHAR
>> ,TERRITORY_ID BIGINT
>> ,T_ZIPCODE BIGINT
>> ,T_PLACE_NAME VARCHAR
>> ,GRPBYCF.T_STATE VARCHAR
>> ,GRPBYCF.T_COUNTRY VARCHAR
>> ,PRODUCT_ID BIGINT NOT NULL
>> ,P_NAME VARCHAR
>> ,P_CATEGORY VARCHAR
>> ,CHANNEL_ID BIGINT
>> ,CH_NAME VARCHAR
>> ,CH_TYPE VARCHAR
>> ,CUSTOMER_ID BIGINT NOT NULL
>> ,CS_NAME VARCHAR
>> ,CS_TYPE VARCHAR
>> ,IS_NEW_CUSTOMER BOOLEAN
>> ,CLOSE_DATE DATE
>> ,DAY_CNT_SPENT INTEGER
>> ,TOTAL_EXPENSES BIGINT
>> ,FORCASTED_SALES BIGINT
>> ,GRPBYCF.UNIT_CNT_SOLD BIGINT
>> ,PRICE_PER_UNIT BIGINT
>> ,DISCOUNT_PERCENT BIGINT
>> ,GRPBYCF.TOTAL_SALES BIGINT
>> ,CONSTRAINT pk PRIMARY KEY (PRODUCT_ID, CUSTOMER_ID)
>> ) COMPRESSION='SNAPPY';
>>
>> I will try out the guidepost width reduction and let you know the results.
>>
>> Thank you,
>> Amit.
>>
>> On Tue, Mar 29, 2016 at 10:50 PM, Mujtaba Chohan 
>> wrote:
>>
>>> Optimization did help somewhat but not to the extent I was expecting.
>>> See chart below.
>>>
>>> [image: Inline image 1]
>>>
>>> Can you share your table schema so I can experiment with it? Another
>>> thing you can try is reducing guidepost
>>>  width for this table by
>>> executing UPDATE STATISTICS TRANSACTIONS SET
>>> "phoenix.stats.guidepost.width"=5000;
>>>
>>>
>>>
>>>
>>> On Tue, Mar 29, 2016 at 6:45 AM, Amit Shah  wrote:
>>>
 Hi Mujtaba,

 I did try the two optimization techniques by recreating the table and
 then loading it again with 10 mil records. They do not seem to help out
 much in terms of the timings. Kindly find the phoenix log file attached.
 Let me know if I am missing anything.

 Thanks,
 Amit.

 On Mon, Mar 28, 2016 at 11:44 PM, Mujtaba Chohan 
 wrote:

> Here's the chart for time it takes for each of the parallel scans
> after split. On RS where data is not read from disk scan gets back in ~20
> secs but for the RS which has 6 it's ~45 secs.
>
> [image: Inline image 2]
>
>  Yes I see disk reads with 607 ios/second on the hosts that stores 6
>> regions
>>
>
> Two things that you should try to reduce disk reads or maybe a
> combination of both 1. Have only the columns used in your group by query 
> in
> a separate column family CREATE TABLE T (K integer primary key,
> GRPBYCF.UNIT_CNT_SOLD integer, GRPBYCF.TOTAL_SALES integer,
> GRPBYCF.T_COUNTRY varchar, ...) 2. Turn on snappy compression for
> your table ALTER TABLE T SET COMPRESSION='SNAPPY' followed by a major
> compaction.
>
> I tried to compact the table from the hbase web UI
>>
>
> You need to do *major_compact* from HBase shell. From UI it's minor.
>
> - mujtaba
>
> On Mon, Mar 28, 2016 at 12:32 AM, Amit Shah 
> wrote:
>
>> Thanks Mujtaba and James for 

Re: Speeding Up Group By Queries

2016-04-11 Thread Amit Shah
Hi Mujtaba,

I observed that if the where-clause and group-by queries are applied on the
primary key columns, then they are superfast (~ 200 ms). This is not the
case with queries that have non-primary key columns in the where clause and
group by queries. I tried configuring the bucket cache but surprisingly it
doesn't give much speed improvement which I had thought of. One of these
queries are taking 26 secs and 31 secs on a 3 node Hbase cluster with and
without bucket cache respectively.

Let me know if you have suggestions that we could try out.

Regards,
Amit.

On Tue, Mar 29, 2016 at 10:55 PM, Amit Shah  wrote:

> Hi Mujtaba,
>
> Could these improvements be because of region distribution across region
> servers? Along with the optimizations you had suggested I had also used
> hbase-region-inspector to move regions evenly across the region server.
>
> Below is the table schema for the TRANSACTIONS table
>
>
> CREATE TABLE TRANSACTIONS_TEST (
> AGENT_ID BIGINT
> ,A_NAME VARCHAR
> ,A_ROLE VARCHAR
> ,TERRITORY_ID BIGINT
> ,T_ZIPCODE BIGINT
> ,T_PLACE_NAME VARCHAR
> ,GRPBYCF.T_STATE VARCHAR
> ,GRPBYCF.T_COUNTRY VARCHAR
> ,PRODUCT_ID BIGINT NOT NULL
> ,P_NAME VARCHAR
> ,P_CATEGORY VARCHAR
> ,CHANNEL_ID BIGINT
> ,CH_NAME VARCHAR
> ,CH_TYPE VARCHAR
> ,CUSTOMER_ID BIGINT NOT NULL
> ,CS_NAME VARCHAR
> ,CS_TYPE VARCHAR
> ,IS_NEW_CUSTOMER BOOLEAN
> ,CLOSE_DATE DATE
> ,DAY_CNT_SPENT INTEGER
> ,TOTAL_EXPENSES BIGINT
> ,FORCASTED_SALES BIGINT
> ,GRPBYCF.UNIT_CNT_SOLD BIGINT
> ,PRICE_PER_UNIT BIGINT
> ,DISCOUNT_PERCENT BIGINT
> ,GRPBYCF.TOTAL_SALES BIGINT
> ,CONSTRAINT pk PRIMARY KEY (PRODUCT_ID, CUSTOMER_ID)
> ) COMPRESSION='SNAPPY';
>
> I will try out the guidepost width reduction and let you know the results.
>
> Thank you,
> Amit.
>
> On Tue, Mar 29, 2016 at 10:50 PM, Mujtaba Chohan 
> wrote:
>
>> Optimization did help somewhat but not to the extent I was expecting. See
>> chart below.
>>
>> [image: Inline image 1]
>>
>> Can you share your table schema so I can experiment with it? Another
>> thing you can try is reducing guidepost
>>  width for this table by
>> executing UPDATE STATISTICS TRANSACTIONS SET
>> "phoenix.stats.guidepost.width"=5000;
>>
>>
>>
>>
>> On Tue, Mar 29, 2016 at 6:45 AM, Amit Shah  wrote:
>>
>>> Hi Mujtaba,
>>>
>>> I did try the two optimization techniques by recreating the table and
>>> then loading it again with 10 mil records. They do not seem to help out
>>> much in terms of the timings. Kindly find the phoenix log file attached.
>>> Let me know if I am missing anything.
>>>
>>> Thanks,
>>> Amit.
>>>
>>> On Mon, Mar 28, 2016 at 11:44 PM, Mujtaba Chohan 
>>> wrote:
>>>
 Here's the chart for time it takes for each of the parallel scans after
 split. On RS where data is not read from disk scan gets back in ~20 secs
 but for the RS which has 6 it's ~45 secs.

 [image: Inline image 2]

  Yes I see disk reads with 607 ios/second on the hosts that stores 6
> regions
>

 Two things that you should try to reduce disk reads or maybe a
 combination of both 1. Have only the columns used in your group by query in
 a separate column family CREATE TABLE T (K integer primary key,
 GRPBYCF.UNIT_CNT_SOLD integer, GRPBYCF.TOTAL_SALES integer,
 GRPBYCF.T_COUNTRY varchar, ...) 2. Turn on snappy compression for your
 table ALTER TABLE T SET COMPRESSION='SNAPPY' followed by a major
 compaction.

 I tried to compact the table from the hbase web UI
>

 You need to do *major_compact* from HBase shell. From UI it's minor.

 - mujtaba

 On Mon, Mar 28, 2016 at 12:32 AM, Amit Shah  wrote:

> Thanks Mujtaba and James for replying back.
>
> Mujtaba, Below are details to your follow up queries
>
> 1. How wide is your table
>
>
> I have 26 columns in the TRANSACTIONS table with a couple of columns
> combined to be marked as a primary key
>
> 2. How many region servers is your data distributed on and what's the
>> heap size?
>
>
> When I posted the initial readings of the query taking around 2
> minutes, I had one region server storing 4 regions for the 10 mil records
> TRANSACTIONS table. The heap size on the master server is 1 GB while the
> region server has 3.63 GB heap setting.
>
> Later I added 2 more region servers to the cluster and configured them
> as data nodes and region servers. After this step, the regions got split 
> on
> two region servers with the count as 2 on one region server and 6 on
> another. I didn't follow what action caused this region split or was it
> automatically done by hbase (load balancer??)
>
> 3. Do you see lots of disk I/O on region servers during aggregation?
>
>
>  Yes I see disk reads with 607 ios/second on the 

Re: Speeding Up Group By Queries

2016-03-29 Thread Amit Shah
Hi Mujtaba,

Could these improvements be because of region distribution across region
servers? Along with the optimizations you had suggested I had also used
hbase-region-inspector to move regions evenly across the region server.

Below is the table schema for the TRANSACTIONS table


CREATE TABLE TRANSACTIONS_TEST (
AGENT_ID BIGINT
,A_NAME VARCHAR
,A_ROLE VARCHAR
,TERRITORY_ID BIGINT
,T_ZIPCODE BIGINT
,T_PLACE_NAME VARCHAR
,GRPBYCF.T_STATE VARCHAR
,GRPBYCF.T_COUNTRY VARCHAR
,PRODUCT_ID BIGINT NOT NULL
,P_NAME VARCHAR
,P_CATEGORY VARCHAR
,CHANNEL_ID BIGINT
,CH_NAME VARCHAR
,CH_TYPE VARCHAR
,CUSTOMER_ID BIGINT NOT NULL
,CS_NAME VARCHAR
,CS_TYPE VARCHAR
,IS_NEW_CUSTOMER BOOLEAN
,CLOSE_DATE DATE
,DAY_CNT_SPENT INTEGER
,TOTAL_EXPENSES BIGINT
,FORCASTED_SALES BIGINT
,GRPBYCF.UNIT_CNT_SOLD BIGINT
,PRICE_PER_UNIT BIGINT
,DISCOUNT_PERCENT BIGINT
,GRPBYCF.TOTAL_SALES BIGINT
,CONSTRAINT pk PRIMARY KEY (PRODUCT_ID, CUSTOMER_ID)
) COMPRESSION='SNAPPY';

I will try out the guidepost width reduction and let you know the results.

Thank you,
Amit.

On Tue, Mar 29, 2016 at 10:50 PM, Mujtaba Chohan  wrote:

> Optimization did help somewhat but not to the extent I was expecting. See
> chart below.
>
> [image: Inline image 1]
>
> Can you share your table schema so I can experiment with it? Another thing
> you can try is reducing guidepost 
> width for this table by executing UPDATE STATISTICS TRANSACTIONS SET
> "phoenix.stats.guidepost.width"=5000;
>
>
>
>
> On Tue, Mar 29, 2016 at 6:45 AM, Amit Shah  wrote:
>
>> Hi Mujtaba,
>>
>> I did try the two optimization techniques by recreating the table and
>> then loading it again with 10 mil records. They do not seem to help out
>> much in terms of the timings. Kindly find the phoenix log file attached.
>> Let me know if I am missing anything.
>>
>> Thanks,
>> Amit.
>>
>> On Mon, Mar 28, 2016 at 11:44 PM, Mujtaba Chohan 
>> wrote:
>>
>>> Here's the chart for time it takes for each of the parallel scans after
>>> split. On RS where data is not read from disk scan gets back in ~20 secs
>>> but for the RS which has 6 it's ~45 secs.
>>>
>>> [image: Inline image 2]
>>>
>>>  Yes I see disk reads with 607 ios/second on the hosts that stores 6
 regions

>>>
>>> Two things that you should try to reduce disk reads or maybe a
>>> combination of both 1. Have only the columns used in your group by query in
>>> a separate column family CREATE TABLE T (K integer primary key,
>>> GRPBYCF.UNIT_CNT_SOLD integer, GRPBYCF.TOTAL_SALES integer,
>>> GRPBYCF.T_COUNTRY varchar, ...) 2. Turn on snappy compression for your
>>> table ALTER TABLE T SET COMPRESSION='SNAPPY' followed by a major
>>> compaction.
>>>
>>> I tried to compact the table from the hbase web UI

>>>
>>> You need to do *major_compact* from HBase shell. From UI it's minor.
>>>
>>> - mujtaba
>>>
>>> On Mon, Mar 28, 2016 at 12:32 AM, Amit Shah  wrote:
>>>
 Thanks Mujtaba and James for replying back.

 Mujtaba, Below are details to your follow up queries

 1. How wide is your table


 I have 26 columns in the TRANSACTIONS table with a couple of columns
 combined to be marked as a primary key

 2. How many region servers is your data distributed on and what's the
> heap size?


 When I posted the initial readings of the query taking around 2
 minutes, I had one region server storing 4 regions for the 10 mil records
 TRANSACTIONS table. The heap size on the master server is 1 GB while the
 region server has 3.63 GB heap setting.

 Later I added 2 more region servers to the cluster and configured them
 as data nodes and region servers. After this step, the regions got split on
 two region servers with the count as 2 on one region server and 6 on
 another. I didn't follow what action caused this region split or was it
 automatically done by hbase (load balancer??)

 3. Do you see lots of disk I/O on region servers during aggregation?


  Yes I see disk reads with 607 ios/second on the hosts that stores 6
 regions. Kindly find the disk io statistics attached as images.

 4. Can you try your query after major compacting your table?


 I tried to compact the table from the hbase web UI. For some reason,
 the compaction table attribute on the web ui is still shown as NONE. After
 these changes, the query time is down to *42 secs. *
 Is compression different from compaction? Would the query performance
 improve by compressing the data by one of the algorithms? Logically it
 doesn't sound right though.

 Can you also replace log4j.properties with the attached one and reply
> back with phoenix.log created by executing your query in sqlline?


 After replacing the log4j.properties, I have captured the logs for the
 group by query 

Re: Speeding Up Group By Queries

2016-03-29 Thread Mujtaba Chohan
Optimization did help somewhat but not to the extent I was expecting. See
chart below.

[image: Inline image 1]

Can you share your table schema so I can experiment with it? Another thing
you can try is reducing guidepost 
width for this table by executing UPDATE STATISTICS TRANSACTIONS SET
"phoenix.stats.guidepost.width"=5000;




On Tue, Mar 29, 2016 at 6:45 AM, Amit Shah  wrote:

> Hi Mujtaba,
>
> I did try the two optimization techniques by recreating the table and then
> loading it again with 10 mil records. They do not seem to help out much in
> terms of the timings. Kindly find the phoenix log file attached. Let me
> know if I am missing anything.
>
> Thanks,
> Amit.
>
> On Mon, Mar 28, 2016 at 11:44 PM, Mujtaba Chohan 
> wrote:
>
>> Here's the chart for time it takes for each of the parallel scans after
>> split. On RS where data is not read from disk scan gets back in ~20 secs
>> but for the RS which has 6 it's ~45 secs.
>>
>> [image: Inline image 2]
>>
>>  Yes I see disk reads with 607 ios/second on the hosts that stores 6
>>> regions
>>>
>>
>> Two things that you should try to reduce disk reads or maybe a
>> combination of both 1. Have only the columns used in your group by query in
>> a separate column family CREATE TABLE T (K integer primary key,
>> GRPBYCF.UNIT_CNT_SOLD integer, GRPBYCF.TOTAL_SALES integer,
>> GRPBYCF.T_COUNTRY varchar, ...) 2. Turn on snappy compression for your
>> table ALTER TABLE T SET COMPRESSION='SNAPPY' followed by a major
>> compaction.
>>
>> I tried to compact the table from the hbase web UI
>>>
>>
>> You need to do *major_compact* from HBase shell. From UI it's minor.
>>
>> - mujtaba
>>
>> On Mon, Mar 28, 2016 at 12:32 AM, Amit Shah  wrote:
>>
>>> Thanks Mujtaba and James for replying back.
>>>
>>> Mujtaba, Below are details to your follow up queries
>>>
>>> 1. How wide is your table
>>>
>>>
>>> I have 26 columns in the TRANSACTIONS table with a couple of columns
>>> combined to be marked as a primary key
>>>
>>> 2. How many region servers is your data distributed on and what's the
 heap size?
>>>
>>>
>>> When I posted the initial readings of the query taking around 2 minutes,
>>> I had one region server storing 4 regions for the 10 mil records
>>> TRANSACTIONS table. The heap size on the master server is 1 GB while the
>>> region server has 3.63 GB heap setting.
>>>
>>> Later I added 2 more region servers to the cluster and configured them
>>> as data nodes and region servers. After this step, the regions got split on
>>> two region servers with the count as 2 on one region server and 6 on
>>> another. I didn't follow what action caused this region split or was it
>>> automatically done by hbase (load balancer??)
>>>
>>> 3. Do you see lots of disk I/O on region servers during aggregation?
>>>
>>>
>>>  Yes I see disk reads with 607 ios/second on the hosts that stores 6
>>> regions. Kindly find the disk io statistics attached as images.
>>>
>>> 4. Can you try your query after major compacting your table?
>>>
>>>
>>> I tried to compact the table from the hbase web UI. For some reason, the
>>> compaction table attribute on the web ui is still shown as NONE. After
>>> these changes, the query time is down to *42 secs. *
>>> Is compression different from compaction? Would the query performance
>>> improve by compressing the data by one of the algorithms? Logically it
>>> doesn't sound right though.
>>>
>>> Can you also replace log4j.properties with the attached one and reply
 back with phoenix.log created by executing your query in sqlline?
>>>
>>>
>>> After replacing the log4j.properties, I have captured the logs for the
>>> group by query execution and attached.
>>>
>>>
>>> James,
>>> If I follow the queries that you pasted, I see the index getting used
>>> but if I try to explain the query plan on the pre-loaded TRANSACTIONS table
>>> I do not see the index being used. Probably the query plan is changing
>>> based on whether the table has data or not.
>>>
>>> The query time is reduced down to 42 secs right now. Let me know if you
>>> have more suggestions on to improve it further.
>>>
>>> Thanks,
>>> Amit.
>>>
>>> On Sat, Mar 26, 2016 at 4:21 AM, James Taylor 
>>> wrote:
>>>
 Hi Amit,
 Using 4.7.0-HBase-1.1 release, I see the index being used for that
 query (see below). An index will help some, as the aggregation can be done
 in place as the scan over the index is occurring (as opposed to having to
 hold the distinct values found during grouping in memory per chunk of work
 and sorting each chunk on the client). It's not going to prevent the entire
 index from being scanned though. You'll need a WHERE clause to prevent 
 that.

 0: jdbc:phoenix:localhost> create table TRANSACTIONS (K integer primary
 key, UNIT_CNT_SOLD integer, TOTAL_SALES integer, T_COUNTRY varchar);
 No rows affected 

Re: Speeding Up Group By Queries

2016-03-29 Thread Amit Shah
Hi Mujtaba,

I did try the two optimization techniques by recreating the table and then
loading it again with 10 mil records. They do not seem to help out much in
terms of the timings. Kindly find the phoenix log file attached. Let me
know if I am missing anything.

Thanks,
Amit.

On Mon, Mar 28, 2016 at 11:44 PM, Mujtaba Chohan  wrote:

> Here's the chart for time it takes for each of the parallel scans after
> split. On RS where data is not read from disk scan gets back in ~20 secs
> but for the RS which has 6 it's ~45 secs.
>
> [image: Inline image 2]
>
>  Yes I see disk reads with 607 ios/second on the hosts that stores 6
>> regions
>>
>
> Two things that you should try to reduce disk reads or maybe a combination
> of both 1. Have only the columns used in your group by query in a separate
> column family CREATE TABLE T (K integer primary key,
> GRPBYCF.UNIT_CNT_SOLD integer, GRPBYCF.TOTAL_SALES integer,
> GRPBYCF.T_COUNTRY varchar, ...) 2. Turn on snappy compression for your
> table ALTER TABLE T SET COMPRESSION='SNAPPY' followed by a major
> compaction.
>
> I tried to compact the table from the hbase web UI
>>
>
> You need to do *major_compact* from HBase shell. From UI it's minor.
>
> - mujtaba
>
> On Mon, Mar 28, 2016 at 12:32 AM, Amit Shah  wrote:
>
>> Thanks Mujtaba and James for replying back.
>>
>> Mujtaba, Below are details to your follow up queries
>>
>> 1. How wide is your table
>>
>>
>> I have 26 columns in the TRANSACTIONS table with a couple of columns
>> combined to be marked as a primary key
>>
>> 2. How many region servers is your data distributed on and what's the
>>> heap size?
>>
>>
>> When I posted the initial readings of the query taking around 2 minutes,
>> I had one region server storing 4 regions for the 10 mil records
>> TRANSACTIONS table. The heap size on the master server is 1 GB while the
>> region server has 3.63 GB heap setting.
>>
>> Later I added 2 more region servers to the cluster and configured them as
>> data nodes and region servers. After this step, the regions got split on
>> two region servers with the count as 2 on one region server and 6 on
>> another. I didn't follow what action caused this region split or was it
>> automatically done by hbase (load balancer??)
>>
>> 3. Do you see lots of disk I/O on region servers during aggregation?
>>
>>
>>  Yes I see disk reads with 607 ios/second on the hosts that stores 6
>> regions. Kindly find the disk io statistics attached as images.
>>
>> 4. Can you try your query after major compacting your table?
>>
>>
>> I tried to compact the table from the hbase web UI. For some reason, the
>> compaction table attribute on the web ui is still shown as NONE. After
>> these changes, the query time is down to *42 secs. *
>> Is compression different from compaction? Would the query performance
>> improve by compressing the data by one of the algorithms? Logically it
>> doesn't sound right though.
>>
>> Can you also replace log4j.properties with the attached one and reply
>>> back with phoenix.log created by executing your query in sqlline?
>>
>>
>> After replacing the log4j.properties, I have captured the logs for the
>> group by query execution and attached.
>>
>>
>> James,
>> If I follow the queries that you pasted, I see the index getting used but
>> if I try to explain the query plan on the pre-loaded TRANSACTIONS table I
>> do not see the index being used. Probably the query plan is changing based
>> on whether the table has data or not.
>>
>> The query time is reduced down to 42 secs right now. Let me know if you
>> have more suggestions on to improve it further.
>>
>> Thanks,
>> Amit.
>>
>> On Sat, Mar 26, 2016 at 4:21 AM, James Taylor 
>> wrote:
>>
>>> Hi Amit,
>>> Using 4.7.0-HBase-1.1 release, I see the index being used for that query
>>> (see below). An index will help some, as the aggregation can be done in
>>> place as the scan over the index is occurring (as opposed to having to hold
>>> the distinct values found during grouping in memory per chunk of work and
>>> sorting each chunk on the client). It's not going to prevent the entire
>>> index from being scanned though. You'll need a WHERE clause to prevent that.
>>>
>>> 0: jdbc:phoenix:localhost> create table TRANSACTIONS (K integer primary
>>> key, UNIT_CNT_SOLD integer, TOTAL_SALES integer, T_COUNTRY varchar);
>>> No rows affected (1.32 seconds)
>>> 0: jdbc:phoenix:localhost> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX ON
>>> TRANSACTIONS (T_COUNTRY) INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES);
>>> No rows affected (6.452 seconds)
>>> 0: jdbc:phoenix:localhost> explain SELECT SUM(UNIT_CNT_SOLD),
>>> SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP BY T_COUNTRY;
>>>
>>> +--+
>>> |   PLAN
>>>   |
>>>
>>> +--+
>>> | CLIENT 1-CHUNK PARALLEL 1-WAY 

Re: Speeding Up Group By Queries

2016-03-28 Thread Mujtaba Chohan
Here's the chart for time it takes for each of the parallel scans after
split. On RS where data is not read from disk scan gets back in ~20 secs
but for the RS which has 6 it's ~45 secs.

[image: Inline image 2]

 Yes I see disk reads with 607 ios/second on the hosts that stores 6 regions
>

Two things that you should try to reduce disk reads or maybe a combination
of both 1. Have only the columns used in your group by query in a separate
column family CREATE TABLE T (K integer primary key, GRPBYCF.UNIT_CNT_SOLD
integer, GRPBYCF.TOTAL_SALES integer, GRPBYCF.T_COUNTRY varchar, ...) 2.
Turn on snappy compression for your table ALTER TABLE T SET
COMPRESSION='SNAPPY' followed by a major compaction.

I tried to compact the table from the hbase web UI
>

You need to do *major_compact* from HBase shell. From UI it's minor.

- mujtaba

On Mon, Mar 28, 2016 at 12:32 AM, Amit Shah  wrote:

> Thanks Mujtaba and James for replying back.
>
> Mujtaba, Below are details to your follow up queries
>
> 1. How wide is your table
>
>
> I have 26 columns in the TRANSACTIONS table with a couple of columns
> combined to be marked as a primary key
>
> 2. How many region servers is your data distributed on and what's the heap
>> size?
>
>
> When I posted the initial readings of the query taking around 2 minutes, I
> had one region server storing 4 regions for the 10 mil records TRANSACTIONS
> table. The heap size on the master server is 1 GB while the region server
> has 3.63 GB heap setting.
>
> Later I added 2 more region servers to the cluster and configured them as
> data nodes and region servers. After this step, the regions got split on
> two region servers with the count as 2 on one region server and 6 on
> another. I didn't follow what action caused this region split or was it
> automatically done by hbase (load balancer??)
>
> 3. Do you see lots of disk I/O on region servers during aggregation?
>
>
>  Yes I see disk reads with 607 ios/second on the hosts that stores 6
> regions. Kindly find the disk io statistics attached as images.
>
> 4. Can you try your query after major compacting your table?
>
>
> I tried to compact the table from the hbase web UI. For some reason, the
> compaction table attribute on the web ui is still shown as NONE. After
> these changes, the query time is down to *42 secs. *
> Is compression different from compaction? Would the query performance
> improve by compressing the data by one of the algorithms? Logically it
> doesn't sound right though.
>
> Can you also replace log4j.properties with the attached one and reply back
>> with phoenix.log created by executing your query in sqlline?
>
>
> After replacing the log4j.properties, I have captured the logs for the
> group by query execution and attached.
>
>
> James,
> If I follow the queries that you pasted, I see the index getting used but
> if I try to explain the query plan on the pre-loaded TRANSACTIONS table I
> do not see the index being used. Probably the query plan is changing based
> on whether the table has data or not.
>
> The query time is reduced down to 42 secs right now. Let me know if you
> have more suggestions on to improve it further.
>
> Thanks,
> Amit.
>
> On Sat, Mar 26, 2016 at 4:21 AM, James Taylor 
> wrote:
>
>> Hi Amit,
>> Using 4.7.0-HBase-1.1 release, I see the index being used for that query
>> (see below). An index will help some, as the aggregation can be done in
>> place as the scan over the index is occurring (as opposed to having to hold
>> the distinct values found during grouping in memory per chunk of work and
>> sorting each chunk on the client). It's not going to prevent the entire
>> index from being scanned though. You'll need a WHERE clause to prevent that.
>>
>> 0: jdbc:phoenix:localhost> create table TRANSACTIONS (K integer primary
>> key, UNIT_CNT_SOLD integer, TOTAL_SALES integer, T_COUNTRY varchar);
>> No rows affected (1.32 seconds)
>> 0: jdbc:phoenix:localhost> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX ON
>> TRANSACTIONS (T_COUNTRY) INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES);
>> No rows affected (6.452 seconds)
>> 0: jdbc:phoenix:localhost> explain SELECT SUM(UNIT_CNT_SOLD),
>> SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP BY T_COUNTRY;
>>
>> +--+
>> |   PLAN
>>   |
>>
>> +--+
>> | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER TRANSACTIONS_COUNTRY_INDEX
>>  |
>> | SERVER AGGREGATE INTO ORDERED DISTINCT ROWS BY ["T_COUNTRY"]
>>   |
>> | CLIENT MERGE SORT
>>  |
>>
>> +--+
>> 3 rows selected (0.028 seconds)
>>
>> Thanks,
>> James
>>
>>
>> On Fri, Mar 25, 2016 at 10:37 AM, Mujtaba Chohan 
>> wrote:
>>
>>> That seems excessively slow for 10M rows which should be in order of few
>>> seconds at most without 

Re: Speeding Up Group By Queries

2016-03-25 Thread James Taylor
Hi Amit,
Using 4.7.0-HBase-1.1 release, I see the index being used for that query
(see below). An index will help some, as the aggregation can be done in
place as the scan over the index is occurring (as opposed to having to hold
the distinct values found during grouping in memory per chunk of work and
sorting each chunk on the client). It's not going to prevent the entire
index from being scanned though. You'll need a WHERE clause to prevent that.

0: jdbc:phoenix:localhost> create table TRANSACTIONS (K integer primary
key, UNIT_CNT_SOLD integer, TOTAL_SALES integer, T_COUNTRY varchar);
No rows affected (1.32 seconds)
0: jdbc:phoenix:localhost> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX ON
TRANSACTIONS (T_COUNTRY) INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES);
No rows affected (6.452 seconds)
0: jdbc:phoenix:localhost> explain SELECT SUM(UNIT_CNT_SOLD),
SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP BY T_COUNTRY;
+--+
|   PLAN   |
+--+
| CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER TRANSACTIONS_COUNTRY_INDEX  |
| SERVER AGGREGATE INTO ORDERED DISTINCT ROWS BY ["T_COUNTRY"] |
| CLIENT MERGE SORT|
+--+
3 rows selected (0.028 seconds)

Thanks,
James


On Fri, Mar 25, 2016 at 10:37 AM, Mujtaba Chohan  wrote:

> That seems excessively slow for 10M rows which should be in order of few
> seconds at most without index. 1. How wide is your table 2. How many region
> servers is your data distributed on and what's the heap size? 3. Do you see
> lots of disk I/O on region servers during aggregation? 4. Can you try your
> query after major compacting your table?
>
> Can you also replace log4j.properties with the attached one and reply back
> with phoenix.log created by executing your query in sqlline?
>
> Thanks,
> Mujtaba
>
>
> On Fri, Mar 25, 2016 at 6:56 AM, Amit Shah  wrote:
>
>> Hi,
>>
>> I am trying to evaluate apache hbase (version 1.0.0) and phoenix
>> (version 4.6) deployed through cloudera for our OLAP workfload. I have a
>> table that has 10 mil rows. I try to execute the below roll up query and it
>> takes around 2 mins to return 1,850 rows.
>>
>> SELECT SUM(UNIT_CNT_SOLD), SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP BY
>> T_COUNTRY;
>>
>> I tried applying the "joining with indices" example given on the website
>>  on the TRANSACTIONS table by
>> creating an index on the grouped by column as below but that doesn't help.
>>
>> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX ON TRANSACTIONS (T_COUNTRY)
>> INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES);
>>
>> This index is not getting used when the query is executed. The query plan
>> is as below
>>
>> +--+
>> |   PLAN   |
>> +--+
>> | CLIENT 31-CHUNK PARALLEL 31-WAY FULL SCAN OVER TRANSACTIONS |
>> | SERVER AGGREGATE INTO DISTINCT ROWS BY [T_COUNTRY] |
>> | CLIENT MERGE SORT|
>> +--+
>>
>> Theoretically can secondary indexes help improve the performance of group
>> by queries?
>>
>> Any suggestions on what are different options in phoenix I could try out
>> to speed up GROUP BY queries?
>>
>> Thanks,
>> Amit.
>>
>
>


Re: Speeding Up Group By Queries

2016-03-25 Thread Mujtaba Chohan
That seems excessively slow for 10M rows which should be in order of few
seconds at most without index. 1. How wide is your table 2. How many region
servers is your data distributed on and what's the heap size? 3. Do you see
lots of disk I/O on region servers during aggregation? 4. Can you try your
query after major compacting your table?

Can you also replace log4j.properties with the attached one and reply back
with phoenix.log created by executing your query in sqlline?

Thanks,
Mujtaba


On Fri, Mar 25, 2016 at 6:56 AM, Amit Shah  wrote:

> Hi,
>
> I am trying to evaluate apache hbase (version 1.0.0) and phoenix (version
> 4.6) deployed through cloudera for our OLAP workfload. I have a table
> that has 10 mil rows. I try to execute the below roll up query and it takes
> around 2 mins to return 1,850 rows.
>
> SELECT SUM(UNIT_CNT_SOLD), SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP BY
> T_COUNTRY;
>
> I tried applying the "joining with indices" example given on the website
>  on the TRANSACTIONS table by
> creating an index on the grouped by column as below but that doesn't help.
>
> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX ON TRANSACTIONS (T_COUNTRY)
> INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES);
>
> This index is not getting used when the query is executed. The query plan
> is as below
>
> +--+
> |   PLAN   |
> +--+
> | CLIENT 31-CHUNK PARALLEL 31-WAY FULL SCAN OVER TRANSACTIONS |
> | SERVER AGGREGATE INTO DISTINCT ROWS BY [T_COUNTRY] |
> | CLIENT MERGE SORT|
> +--+
>
> Theoretically can secondary indexes help improve the performance of group
> by queries?
>
> Any suggestions on what are different options in phoenix I could try out
> to speed up GROUP BY queries?
>
> Thanks,
> Amit.
>


log4j.properties
Description: Binary data


Speeding Up Group By Queries

2016-03-25 Thread Amit Shah
Hi,

I am trying to evaluate apache hbase (version 1.0.0) and phoenix (version
4.6) deployed through cloudera for our OLAP workfload. I have a table that
has 10 mil rows. I try to execute the below roll up query and it takes
around 2 mins to return 1,850 rows.

SELECT SUM(UNIT_CNT_SOLD), SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP BY
T_COUNTRY;

I tried applying the "joining with indices" example given on the website
 on the TRANSACTIONS table by
creating an index on the grouped by column as below but that doesn't help.

CREATE INDEX TRANSACTIONS_COUNTRY_INDEX ON TRANSACTIONS (T_COUNTRY) INCLUDE
(UNIT_CNT_SOLD, TOTAL_SALES);

This index is not getting used when the query is executed. The query plan
is as below

+--+
|   PLAN   |
+--+
| CLIENT 31-CHUNK PARALLEL 31-WAY FULL SCAN OVER TRANSACTIONS |
| SERVER AGGREGATE INTO DISTINCT ROWS BY [T_COUNTRY] |
| CLIENT MERGE SORT|
+--+

Theoretically can secondary indexes help improve the performance of group
by queries?

Any suggestions on what are different options in phoenix I could try out to
speed up GROUP BY queries?

Thanks,
Amit.