I would try writing the hourly values as 24 columns in a daily row, or as an array type.
I’m not up to speed on the latest Phoenix features, but if it could update a daily sum on the fly that might be ok. If that doesn’t exist yet or isn’t performant, it could be done in an Hbase coprocessor. I would also put the daily sum column in a separate column family in the same table, rather than a separate table. - Jonathan > On Nov 26, 2018, at 5:53 PM, Thomas D'Silva <tdsi...@salesforce.com> wrote: > > The 1-day aggregate query you are running should work, you might have to > increase the client thread pool and queue size. > See > http://mail-archives.apache.org/mod_mbox/phoenix-user/201607.mbox/%3c577d338d.2080...@gmail.com%3E > >> On Tue, Nov 20, 2018 at 9:26 PM Monil Gandhi <mgand...@gmail.com> wrote: >> Thanks for getting back. >> The query we run is always an aggregation query. The date range is anytime >> current to 2 years back. >> I was looking into UPSERT SELECT, but there does not seem to be an easy way >> to run it. For instance I thought running something as below would be optimal >> >> UPSERT INTO DAILY_PUBLISHER_V4(segment_id, cross_segment_id, day, reach) >> VALUES (SELECT segment_id, cross_segment_id, TO_DATE('2017-07-01', >> 'yyyy-MM-dd', 'GMT') as day, CAST(SUM(reach) AS BIGINT) as reach FROM >> PUBLISHER_V4 WHERE day BETWEEN TO_DATE('2018-07-01', 'yyyy-MM-dd', 'GMT') >> AND TO_DATE('2018-07-02', 'yyyy-MM-dd', 'GMT') GROUP BY segment_id, >> cross_segment_id) >> >> Hence I tried running the select statement to see if it works. I started to >> see below error >> Error: Task >> org.apache.phoenix.job.JobManager$InstrumentedJobFutureTask@a66e580 rejected >> from org.apache.phoenix.job.JobManager$1@20de05e5[Running, pool size = 128, >> active threads = 128, queued tasks = 5000, completed tasks = 5153] >> (state=08000,code=101) >> >> When I changed the SELECT query to include a particular sid, the upsert >> select worked. >> Hence I think the only way would be for me to run UPSERt for generating >> daily data for range of sids or segment_id. >> >> Did I miss something? >> >>> On Tue, Nov 20, 2018 at 9:59 AM Thomas D'Silva <tdsi...@salesforce.com> >>> wrote: >>> Since your PK already leads with (sid, day) I don't think adding a >>> secondary index will help. Do you generally always run the aggregation >>> query for the recently inserted data? The row timestamp feature might help >>> in this case >>> https://phoenix.apache.org/rowtimestamp.html >>> If you run the same aggregate queries multiple times then another approach >>> is to store the coarser daily aggregated data in a separate table that you >>> can populate using an UPSERT SELECT. >>> I'm not sure why the explain plan you attached has a CLIENT MERGE SORT, >>> since you don't have an order by. >>> >>>> On Fri, Nov 16, 2018 at 9:44 AM Monil Gandhi <mgand...@gmail.com> wrote: >>>> Here it is >>>> CLIENT 1-CHUNK PARALLEL 1-WAY RANGE SCAN OVER PUBLISHER_V4 >>>> [-3790006267705642038,'2018-07-01 00:00:00.000'] - [-37900062677056420 >>>> SERVER AGGREGATE INTO DISTINCT ROWS BY [SEGMENT_ID, CROSS_SEGMENT_ID] >>>> CLIENT MERGE SORT >>>> >>>> Note: we have a dedicated phoenix query server >>>> >>>>> On Thu, Nov 15, 2018 at 5:23 PM Geoffrey Jacoby <gjac...@salesforce.com> >>>>> wrote: >>>>> Monil, >>>>> >>>>> Could you please post the results of an EXPLAIN plan of your query? For >>>>> directions how to do this please see >>>>> http://phoenix.apache.org/explainplan.html >>>>> >>>>> Geoffrey Jacoby >>>>> >>>>>> On Thu, Nov 15, 2018 at 5:02 PM Monil Gandhi <mgand...@gmail.com> wrote: >>>>>> Hello, >>>>>> Currently we have hourly data in our phoenix table. However, the schema >>>>>> was designed to perform well for daily data. Increasing the number of >>>>>> rows by 24X has lead to degradation of our service over time. >>>>>> Our current schema is as follows >>>>>> >>>>>> CREATE TABLE IF NOT EXISTS T1 (sid BIGINT NOT NULL, day DATE NOT NULL, >>>>>> cid BIGINT NOT NULL, s.count INTEGER, CONSTRAINT PK PRIMARY KEY (sid, >>>>>> day, cid) ) COMPRESSION='SNAPPY' >>>>>> >>>>>> The query we run is something along the lines of >>>>>> SELECT sid, cid, CAST(SUM(count) AS BIGINT) AS count FROM PUBLISHER_V4 >>>>>> WHERE sid IN (34535) AND day BETWEEN TO_DATE('2018-07-01', 'yyyy-MM-dd', >>>>>> 'GMT') AND TO_DATE('2018-07-02', 'yyyy-MM-dd', 'GMT') GROUP BY sid, cid >>>>>> >>>>>> Based on our investigation we have concluded that the main reason is >>>>>> purely the number of rows that are being read. I am open to other >>>>>> suggestions >>>>>> >>>>>> If number of rows is the case >>>>>> I am wondering if there is a way to either >>>>>> 1. to roll hourly data to daily using views, secondary index or map >>>>>> reduce. I know map reduce is possible. >>>>>> 2. migrate to a newer schema where cid is not part of pk and is actually >>>>>> a column family. I was unable to find any kind of documentation on this. >>>>>> >>>>>> Thanks >>>>>> Monil