The 1-day aggregate query you are running should work, you might have to increase the client thread pool and queue size. See http://mail-archives.apache.org/mod_mbox/phoenix-user/201607.mbox/%3c577d338d.2080...@gmail.com%3E
On Tue, Nov 20, 2018 at 9:26 PM Monil Gandhi <mgand...@gmail.com> wrote: > Thanks for getting back. > The query we run is always an aggregation query. The date range is anytime > current to 2 years back. > I was looking into UPSERT SELECT, but there does not seem to be an easy > way to run it. For instance I thought running something as below would be > optimal > > UPSERT INTO DAILY_PUBLISHER_V4(segment_id, cross_segment_id, day, reach) > VALUES (SELECT segment_id, cross_segment_id, TO_DATE('2017-07-01', > 'yyyy-MM-dd', 'GMT') as day, CAST(SUM(reach) AS BIGINT) as reach FROM > PUBLISHER_V4 WHERE day BETWEEN TO_DATE('2018-07-01', 'yyyy-MM-dd', 'GMT') > AND TO_DATE('2018-07-02', 'yyyy-MM-dd', 'GMT') GROUP BY segment_id, > cross_segment_id) > > Hence I tried running the select statement to see if it works. I started > to see below error > Error: Task > org.apache.phoenix.job.JobManager$InstrumentedJobFutureTask@a66e580 > rejected from org.apache.phoenix.job.JobManager$1@20de05e5[Running, pool > size = 128, active threads = 128, queued tasks = 5000, completed tasks = > 5153] (state=08000,code=101) > > When I changed the SELECT query to include a particular sid, the upsert > select worked. > Hence I think the only way would be for me to run UPSERt for generating > daily data for range of sids or segment_id. > > Did I miss something? > > On Tue, Nov 20, 2018 at 9:59 AM Thomas D'Silva <tdsi...@salesforce.com> > wrote: > >> Since your PK already leads with (sid, day) I don't think adding a >> secondary index will help. Do you generally always run the aggregation >> query for the recently inserted data? The row timestamp feature might help >> in this case >> https://phoenix.apache.org/rowtimestamp.html >> If you run the same aggregate queries multiple times then another >> approach is to store the coarser daily aggregated data in a separate table >> that you can populate using an UPSERT SELECT. >> I'm not sure why the explain plan you attached has a CLIENT MERGE SORT, >> since you don't have an order by. >> >> On Fri, Nov 16, 2018 at 9:44 AM Monil Gandhi <mgand...@gmail.com> wrote: >> >>> Here it is >>> CLIENT 1-CHUNK PARALLEL 1-WAY RANGE SCAN OVER PUBLISHER_V4 >>> [-3790006267705642038,'2018-07-01 00:00:00.000'] - [-37900062677056420 >>> SERVER AGGREGATE INTO DISTINCT ROWS BY [SEGMENT_ID, CROSS_SEGMENT_ID] >>> CLIENT MERGE SORT >>> >>> Note: we have a dedicated phoenix query server >>> >>> On Thu, Nov 15, 2018 at 5:23 PM Geoffrey Jacoby <gjac...@salesforce.com> >>> wrote: >>> >>>> Monil, >>>> >>>> Could you please post the results of an EXPLAIN plan of your query? For >>>> directions how to do this please see >>>> http://phoenix.apache.org/explainplan.html >>>> >>>> Geoffrey Jacoby >>>> >>>> On Thu, Nov 15, 2018 at 5:02 PM Monil Gandhi <mgand...@gmail.com> >>>> wrote: >>>> >>>>> Hello, >>>>> Currently we have hourly data in our phoenix table. However, the >>>>> schema was designed to perform well for daily data. Increasing the number >>>>> of rows by 24X has lead to degradation of our service over time. >>>>> Our current schema is as follows >>>>> >>>>> CREATE TABLE IF NOT EXISTS T1 (sid BIGINT NOT NULL, day DATE NOT NULL, >>>>> cid BIGINT NOT NULL, s.count INTEGER, CONSTRAINT PK PRIMARY KEY (sid, day, >>>>> cid) ) COMPRESSION='SNAPPY' >>>>> >>>>> The query we run is something along the lines of >>>>> SELECT sid, cid, CAST(SUM(count) AS BIGINT) AS count FROM PUBLISHER_V4 >>>>> WHERE sid IN (34535) AND day BETWEEN TO_DATE('2018-07-01', 'yyyy-MM-dd', >>>>> 'GMT') AND TO_DATE('2018-07-02', 'yyyy-MM-dd', 'GMT') GROUP BY sid, cid >>>>> >>>>> Based on our investigation we have concluded that the main reason is >>>>> purely the number of rows that are being read. I am open to other >>>>> suggestions >>>>> >>>>> If number of rows is the case >>>>> I am wondering if there is a way to either >>>>> 1. to roll hourly data to daily using views, secondary index or map >>>>> reduce. I know map reduce is possible. >>>>> 2. migrate to a newer schema where cid is not part of pk and is >>>>> actually a column family. I was unable to find any kind of documentation >>>>> on >>>>> this. >>>>> >>>>> Thanks >>>>> Monil >>>>> >>>>