Re: Rolling hourly data

Jonathan Leech Tue, 27 Nov 2018 00:20:02 -0800

I would try writing the hourly values as 24 columns in a daily row, or as an 
array type.


 I’m not up to speed on the latest Phoenix features, but if it could update a 
daily sum on the fly that might be ok. If that doesn’t exist yet or isn’t 
performant, it could be done in an Hbase coprocessor.

I would also put the daily sum column in a separate column family in the same 
table, rather than a separate table.

- Jonathan

> On Nov 26, 2018, at 5:53 PM, Thomas D'Silva <tdsi...@salesforce.com> wrote:
> 
> The 1-day aggregate query you are running should work, you might have to 
> increase the client thread pool and queue size.
> See 
> http://mail-archives.apache.org/mod_mbox/phoenix-user/201607.mbox/%3c577d338d.2080...@gmail.com%3E
> 
>> On Tue, Nov 20, 2018 at 9:26 PM Monil Gandhi <mgand...@gmail.com> wrote:
>> Thanks for getting back.
>> The query we run is always an aggregation query. The date range is anytime 
>> current to 2 years back. 
>> I was looking into UPSERT SELECT, but there does not seem to be an easy way 
>> to run it. For instance I thought running something as below would be optimal
>> 
>> UPSERT INTO DAILY_PUBLISHER_V4(segment_id, cross_segment_id, day, reach) 
>> VALUES (SELECT segment_id, cross_segment_id, TO_DATE('2017-07-01', 
>> 'yyyy-MM-dd', 'GMT') as day, CAST(SUM(reach) AS BIGINT) as reach FROM 
>> PUBLISHER_V4 WHERE day BETWEEN TO_DATE('2018-07-01', 'yyyy-MM-dd', 'GMT') 
>> AND TO_DATE('2018-07-02', 'yyyy-MM-dd', 'GMT') GROUP BY segment_id, 
>> cross_segment_id)
>> 
>> Hence I tried running the select statement to see if it works. I started to 
>> see below error
>> Error: Task 
>> org.apache.phoenix.job.JobManager$InstrumentedJobFutureTask@a66e580 rejected 
>> from org.apache.phoenix.job.JobManager$1@20de05e5[Running, pool size = 128, 
>> active threads = 128, queued tasks = 5000, completed tasks = 5153] 
>> (state=08000,code=101)
>> 
>> When I changed the SELECT query to include a particular sid, the upsert 
>> select worked.
>> Hence I think the only way would be for me to run UPSERt for generating 
>> daily data for range of sids or segment_id.
>> 
>> Did I miss something?
>> 
>>> On Tue, Nov 20, 2018 at 9:59 AM Thomas D'Silva <tdsi...@salesforce.com> 
>>> wrote:
>>> Since your PK already leads with (sid, day) I don't think adding a 
>>> secondary index will help. Do you generally always run the aggregation 
>>> query for the recently inserted data? The row timestamp feature might help 
>>> in this case 
>>> https://phoenix.apache.org/rowtimestamp.html
>>> If you run the same aggregate queries multiple times then another approach 
>>> is to store the coarser daily aggregated data in a separate table that you 
>>> can populate using an UPSERT SELECT.
>>> I'm not sure why the explain plan you attached has a CLIENT MERGE SORT, 
>>> since you don't have an order by.
>>> 
>>>> On Fri, Nov 16, 2018 at 9:44 AM Monil Gandhi <mgand...@gmail.com> wrote:
>>>> Here it is 
>>>> CLIENT 1-CHUNK PARALLEL 1-WAY RANGE SCAN OVER PUBLISHER_V4 
>>>> [-3790006267705642038,'2018-07-01 00:00:00.000'] - [-37900062677056420 
>>>> SERVER AGGREGATE INTO DISTINCT ROWS BY [SEGMENT_ID, CROSS_SEGMENT_ID]
>>>> CLIENT MERGE SORT
>>>> 
>>>> Note: we have a dedicated phoenix query server
>>>> 
>>>>> On Thu, Nov 15, 2018 at 5:23 PM Geoffrey Jacoby <gjac...@salesforce.com> 
>>>>> wrote:
>>>>> Monil,
>>>>> 
>>>>> Could you please post the results of an EXPLAIN plan of your query? For 
>>>>> directions how to do this please see 
>>>>> http://phoenix.apache.org/explainplan.html
>>>>> 
>>>>> Geoffrey Jacoby
>>>>> 
>>>>>> On Thu, Nov 15, 2018 at 5:02 PM Monil Gandhi <mgand...@gmail.com> wrote:
>>>>>> Hello,
>>>>>> Currently we have hourly data in our phoenix table. However, the schema 
>>>>>> was designed to perform well for daily data. Increasing the number of 
>>>>>> rows by 24X has lead to degradation of our service over time. 
>>>>>> Our current schema is as follows 
>>>>>> 
>>>>>> CREATE TABLE IF NOT EXISTS T1 (sid BIGINT NOT NULL, day DATE NOT NULL, 
>>>>>> cid BIGINT NOT NULL, s.count INTEGER, CONSTRAINT PK PRIMARY KEY (sid, 
>>>>>> day, cid) ) COMPRESSION='SNAPPY'
>>>>>> 
>>>>>> The query we run is something along the lines of 
>>>>>> SELECT sid, cid, CAST(SUM(count) AS BIGINT) AS count FROM PUBLISHER_V4 
>>>>>> WHERE sid IN (34535) AND day BETWEEN TO_DATE('2018-07-01', 'yyyy-MM-dd', 
>>>>>> 'GMT') AND TO_DATE('2018-07-02', 'yyyy-MM-dd', 'GMT') GROUP BY sid, cid
>>>>>> 
>>>>>> Based on our investigation we have concluded that the main reason is 
>>>>>> purely the number of rows that are being read. I am open to other 
>>>>>> suggestions
>>>>>> 
>>>>>> If number of rows is the case
>>>>>> I am wondering if there is a way to either
>>>>>> 1. to roll hourly data to daily using views, secondary index or map 
>>>>>> reduce. I know map reduce is possible.
>>>>>> 2. migrate to a newer schema where cid is not part of pk and is actually 
>>>>>> a column family. I was unable to find any kind of documentation on this.
>>>>>> 
>>>>>> Thanks
>>>>>> Monil

Re: Rolling hourly data

Reply via email to