Thanks for getting back. The query we run is always an aggregation query. The date range is anytime current to 2 years back. I was looking into UPSERT SELECT, but there does not seem to be an easy way to run it. For instance I thought running something as below would be optimal
UPSERT INTO DAILY_PUBLISHER_V4(segment_id, cross_segment_id, day, reach) VALUES (SELECT segment_id, cross_segment_id, TO_DATE('2017-07-01', 'yyyy-MM-dd', 'GMT') as day, CAST(SUM(reach) AS BIGINT) as reach FROM PUBLISHER_V4 WHERE day BETWEEN TO_DATE('2018-07-01', 'yyyy-MM-dd', 'GMT') AND TO_DATE('2018-07-02', 'yyyy-MM-dd', 'GMT') GROUP BY segment_id, cross_segment_id) Hence I tried running the select statement to see if it works. I started to see below error Error: Task org.apache.phoenix.job.JobManager$InstrumentedJobFutureTask@a66e580 rejected from org.apache.phoenix.job.JobManager$1@20de05e5[Running, pool size = 128, active threads = 128, queued tasks = 5000, completed tasks = 5153] (state=08000,code=101) When I changed the SELECT query to include a particular sid, the upsert select worked. Hence I think the only way would be for me to run UPSERt for generating daily data for range of sids or segment_id. Did I miss something? On Tue, Nov 20, 2018 at 9:59 AM Thomas D'Silva <tdsi...@salesforce.com> wrote: > Since your PK already leads with (sid, day) I don't think adding a > secondary index will help. Do you generally always run the aggregation > query for the recently inserted data? The row timestamp feature might help > in this case > https://phoenix.apache.org/rowtimestamp.html > If you run the same aggregate queries multiple times then another approach > is to store the coarser daily aggregated data in a separate table that you > can populate using an UPSERT SELECT. > I'm not sure why the explain plan you attached has a CLIENT MERGE SORT, > since you don't have an order by. > > On Fri, Nov 16, 2018 at 9:44 AM Monil Gandhi <mgand...@gmail.com> wrote: > >> Here it is >> CLIENT 1-CHUNK PARALLEL 1-WAY RANGE SCAN OVER PUBLISHER_V4 >> [-3790006267705642038,'2018-07-01 00:00:00.000'] - [-37900062677056420 >> SERVER AGGREGATE INTO DISTINCT ROWS BY [SEGMENT_ID, CROSS_SEGMENT_ID] >> CLIENT MERGE SORT >> >> Note: we have a dedicated phoenix query server >> >> On Thu, Nov 15, 2018 at 5:23 PM Geoffrey Jacoby <gjac...@salesforce.com> >> wrote: >> >>> Monil, >>> >>> Could you please post the results of an EXPLAIN plan of your query? For >>> directions how to do this please see >>> http://phoenix.apache.org/explainplan.html >>> >>> Geoffrey Jacoby >>> >>> On Thu, Nov 15, 2018 at 5:02 PM Monil Gandhi <mgand...@gmail.com> wrote: >>> >>>> Hello, >>>> Currently we have hourly data in our phoenix table. However, the schema >>>> was designed to perform well for daily data. Increasing the number of rows >>>> by 24X has lead to degradation of our service over time. >>>> Our current schema is as follows >>>> >>>> CREATE TABLE IF NOT EXISTS T1 (sid BIGINT NOT NULL, day DATE NOT NULL, >>>> cid BIGINT NOT NULL, s.count INTEGER, CONSTRAINT PK PRIMARY KEY (sid, day, >>>> cid) ) COMPRESSION='SNAPPY' >>>> >>>> The query we run is something along the lines of >>>> SELECT sid, cid, CAST(SUM(count) AS BIGINT) AS count FROM PUBLISHER_V4 >>>> WHERE sid IN (34535) AND day BETWEEN TO_DATE('2018-07-01', 'yyyy-MM-dd', >>>> 'GMT') AND TO_DATE('2018-07-02', 'yyyy-MM-dd', 'GMT') GROUP BY sid, cid >>>> >>>> Based on our investigation we have concluded that the main reason is >>>> purely the number of rows that are being read. I am open to other >>>> suggestions >>>> >>>> If number of rows is the case >>>> I am wondering if there is a way to either >>>> 1. to roll hourly data to daily using views, secondary index or map >>>> reduce. I know map reduce is possible. >>>> 2. migrate to a newer schema where cid is not part of pk and is >>>> actually a column family. I was unable to find any kind of documentation on >>>> this. >>>> >>>> Thanks >>>> Monil >>>> >>>