Re: Rolling hourly data

Thomas D'Silva Tue, 20 Nov 2018 09:59:37 -0800

Since your PK already leads with (sid, day) I don't think adding a
secondary index will help. Do you generally always run the aggregation
query for the recently inserted data? The row timestamp feature might help
in this case
https://phoenix.apache.org/rowtimestamp.html
If you run the same aggregate queries multiple times then another approach
is to store the coarser daily aggregated data in a separate table that you
can populate using an UPSERT SELECT.
I'm not sure why the explain plan you attached has a CLIENT MERGE SORT,
since you don't have an order by.


On Fri, Nov 16, 2018 at 9:44 AM Monil Gandhi <mgand...@gmail.com> wrote:

> Here it is
> CLIENT 1-CHUNK PARALLEL 1-WAY RANGE SCAN OVER PUBLISHER_V4
> [-3790006267705642038,'2018-07-01 00:00:00.000'] - [-37900062677056420
> SERVER AGGREGATE INTO DISTINCT ROWS BY [SEGMENT_ID, CROSS_SEGMENT_ID]
> CLIENT MERGE SORT
>
> Note: we have a dedicated phoenix query server
>
> On Thu, Nov 15, 2018 at 5:23 PM Geoffrey Jacoby <gjac...@salesforce.com>
> wrote:
>
>> Monil,
>>
>> Could you please post the results of an EXPLAIN plan of your query? For
>> directions how to do this please see
>> http://phoenix.apache.org/explainplan.html
>>
>> Geoffrey Jacoby
>>
>> On Thu, Nov 15, 2018 at 5:02 PM Monil Gandhi <mgand...@gmail.com> wrote:
>>
>>> Hello,
>>> Currently we have hourly data in our phoenix table. However, the schema
>>> was designed to perform well for daily data. Increasing the number of rows
>>> by 24X has lead to degradation of our service over time.
>>> Our current schema is as follows
>>>
>>> CREATE TABLE IF NOT EXISTS T1 (sid BIGINT NOT NULL, day DATE NOT NULL,
>>> cid BIGINT NOT NULL, s.count INTEGER, CONSTRAINT PK PRIMARY KEY (sid, day,
>>> cid) ) COMPRESSION='SNAPPY'
>>>
>>> The query we run is something along the lines of
>>> SELECT sid, cid, CAST(SUM(count) AS BIGINT) AS count FROM PUBLISHER_V4
>>> WHERE sid IN (34535) AND day BETWEEN TO_DATE('2018-07-01', 'yyyy-MM-dd',
>>> 'GMT') AND TO_DATE('2018-07-02', 'yyyy-MM-dd', 'GMT') GROUP BY sid, cid
>>>
>>> Based on our investigation we have concluded that the main reason is
>>> purely the number of rows that are being read. I am open to other
>>> suggestions
>>>
>>> If number of rows is the case
>>> I am wondering if there is a way to either
>>> 1. to roll hourly data to daily using views, secondary index or map
>>> reduce. I know map reduce is possible.
>>> 2. migrate to a newer schema where cid is not part of pk and is actually
>>> a column family. I was unable to find any kind of documentation on this.
>>>
>>> Thanks
>>> Monil
>>>
>>

Re: Rolling hourly data

Reply via email to