Since your PK already leads with (sid, day) I don't think adding a secondary index will help. Do you generally always run the aggregation query for the recently inserted data? The row timestamp feature might help in this case https://phoenix.apache.org/rowtimestamp.html If you run the same aggregate queries multiple times then another approach is to store the coarser daily aggregated data in a separate table that you can populate using an UPSERT SELECT. I'm not sure why the explain plan you attached has a CLIENT MERGE SORT, since you don't have an order by.
On Fri, Nov 16, 2018 at 9:44 AM Monil Gandhi <mgand...@gmail.com> wrote: > Here it is > CLIENT 1-CHUNK PARALLEL 1-WAY RANGE SCAN OVER PUBLISHER_V4 > [-3790006267705642038,'2018-07-01 00:00:00.000'] - [-37900062677056420 > SERVER AGGREGATE INTO DISTINCT ROWS BY [SEGMENT_ID, CROSS_SEGMENT_ID] > CLIENT MERGE SORT > > Note: we have a dedicated phoenix query server > > On Thu, Nov 15, 2018 at 5:23 PM Geoffrey Jacoby <gjac...@salesforce.com> > wrote: > >> Monil, >> >> Could you please post the results of an EXPLAIN plan of your query? For >> directions how to do this please see >> http://phoenix.apache.org/explainplan.html >> >> Geoffrey Jacoby >> >> On Thu, Nov 15, 2018 at 5:02 PM Monil Gandhi <mgand...@gmail.com> wrote: >> >>> Hello, >>> Currently we have hourly data in our phoenix table. However, the schema >>> was designed to perform well for daily data. Increasing the number of rows >>> by 24X has lead to degradation of our service over time. >>> Our current schema is as follows >>> >>> CREATE TABLE IF NOT EXISTS T1 (sid BIGINT NOT NULL, day DATE NOT NULL, >>> cid BIGINT NOT NULL, s.count INTEGER, CONSTRAINT PK PRIMARY KEY (sid, day, >>> cid) ) COMPRESSION='SNAPPY' >>> >>> The query we run is something along the lines of >>> SELECT sid, cid, CAST(SUM(count) AS BIGINT) AS count FROM PUBLISHER_V4 >>> WHERE sid IN (34535) AND day BETWEEN TO_DATE('2018-07-01', 'yyyy-MM-dd', >>> 'GMT') AND TO_DATE('2018-07-02', 'yyyy-MM-dd', 'GMT') GROUP BY sid, cid >>> >>> Based on our investigation we have concluded that the main reason is >>> purely the number of rows that are being read. I am open to other >>> suggestions >>> >>> If number of rows is the case >>> I am wondering if there is a way to either >>> 1. to roll hourly data to daily using views, secondary index or map >>> reduce. I know map reduce is possible. >>> 2. migrate to a newer schema where cid is not part of pk and is actually >>> a column family. I was unable to find any kind of documentation on this. >>> >>> Thanks >>> Monil >>> >>