Re: Rolling hourly data

Thomas D'Silva Mon, 26 Nov 2018 16:53:53 -0800

The 1-day aggregate query you are running should work, you might have to
increase the client thread pool and queue size.
See
http://mail-archives.apache.org/mod_mbox/phoenix-user/201607.mbox/%3c577d338d.2080...@gmail.com%3E


On Tue, Nov 20, 2018 at 9:26 PM Monil Gandhi <mgand...@gmail.com> wrote:

> Thanks for getting back.
> The query we run is always an aggregation query. The date range is anytime
> current to 2 years back.
> I was looking into UPSERT SELECT, but there does not seem to be an easy
> way to run it. For instance I thought running something as below would be
> optimal
>
> UPSERT INTO DAILY_PUBLISHER_V4(segment_id, cross_segment_id, day, reach)
> VALUES (SELECT segment_id, cross_segment_id, TO_DATE('2017-07-01',
> 'yyyy-MM-dd', 'GMT') as day, CAST(SUM(reach) AS BIGINT) as reach FROM
> PUBLISHER_V4 WHERE day BETWEEN TO_DATE('2018-07-01', 'yyyy-MM-dd', 'GMT')
> AND TO_DATE('2018-07-02', 'yyyy-MM-dd', 'GMT') GROUP BY segment_id,
> cross_segment_id)
>
> Hence I tried running the select statement to see if it works. I started
> to see below error
> Error: Task
> org.apache.phoenix.job.JobManager$InstrumentedJobFutureTask@a66e580
> rejected from org.apache.phoenix.job.JobManager$1@20de05e5[Running, pool
> size = 128, active threads = 128, queued tasks = 5000, completed tasks =
> 5153] (state=08000,code=101)
>
> When I changed the SELECT query to include a particular sid, the upsert
> select worked.
> Hence I think the only way would be for me to run UPSERt for generating
> daily data for range of sids or segment_id.
>
> Did I miss something?
>
> On Tue, Nov 20, 2018 at 9:59 AM Thomas D'Silva <tdsi...@salesforce.com>
> wrote:
>
>> Since your PK already leads with (sid, day) I don't think adding a
>> secondary index will help. Do you generally always run the aggregation
>> query for the recently inserted data? The row timestamp feature might help
>> in this case
>> https://phoenix.apache.org/rowtimestamp.html
>> If you run the same aggregate queries multiple times then another
>> approach is to store the coarser daily aggregated data in a separate table
>> that you can populate using an UPSERT SELECT.
>> I'm not sure why the explain plan you attached has a CLIENT MERGE SORT,
>> since you don't have an order by.
>>
>> On Fri, Nov 16, 2018 at 9:44 AM Monil Gandhi <mgand...@gmail.com> wrote:
>>
>>> Here it is
>>> CLIENT 1-CHUNK PARALLEL 1-WAY RANGE SCAN OVER PUBLISHER_V4
>>> [-3790006267705642038,'2018-07-01 00:00:00.000'] - [-37900062677056420
>>> SERVER AGGREGATE INTO DISTINCT ROWS BY [SEGMENT_ID, CROSS_SEGMENT_ID]
>>> CLIENT MERGE SORT
>>>
>>> Note: we have a dedicated phoenix query server
>>>
>>> On Thu, Nov 15, 2018 at 5:23 PM Geoffrey Jacoby <gjac...@salesforce.com>
>>> wrote:
>>>
>>>> Monil,
>>>>
>>>> Could you please post the results of an EXPLAIN plan of your query? For
>>>> directions how to do this please see
>>>> http://phoenix.apache.org/explainplan.html
>>>>
>>>> Geoffrey Jacoby
>>>>
>>>> On Thu, Nov 15, 2018 at 5:02 PM Monil Gandhi <mgand...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>> Currently we have hourly data in our phoenix table. However, the
>>>>> schema was designed to perform well for daily data. Increasing the number
>>>>> of rows by 24X has lead to degradation of our service over time.
>>>>> Our current schema is as follows
>>>>>
>>>>> CREATE TABLE IF NOT EXISTS T1 (sid BIGINT NOT NULL, day DATE NOT NULL,
>>>>> cid BIGINT NOT NULL, s.count INTEGER, CONSTRAINT PK PRIMARY KEY (sid, day,
>>>>> cid) ) COMPRESSION='SNAPPY'
>>>>>
>>>>> The query we run is something along the lines of
>>>>> SELECT sid, cid, CAST(SUM(count) AS BIGINT) AS count FROM PUBLISHER_V4
>>>>> WHERE sid IN (34535) AND day BETWEEN TO_DATE('2018-07-01', 'yyyy-MM-dd',
>>>>> 'GMT') AND TO_DATE('2018-07-02', 'yyyy-MM-dd', 'GMT') GROUP BY sid, cid
>>>>>
>>>>> Based on our investigation we have concluded that the main reason is
>>>>> purely the number of rows that are being read. I am open to other
>>>>> suggestions
>>>>>
>>>>> If number of rows is the case
>>>>> I am wondering if there is a way to either
>>>>> 1. to roll hourly data to daily using views, secondary index or map
>>>>> reduce. I know map reduce is possible.
>>>>> 2. migrate to a newer schema where cid is not part of pk and is
>>>>> actually a column family. I was unable to find any kind of documentation 
>>>>> on
>>>>> this.
>>>>>
>>>>> Thanks
>>>>> Monil
>>>>>
>>>>

Re: Rolling hourly data

Reply via email to