Re: Rolling hourly data

Monil Gandhi Tue, 20 Nov 2018 21:27:01 -0800

Thanks for getting back.
The query we run is always an aggregation query. The date range is anytime
current to 2 years back.
I was looking into UPSERT SELECT, but there does not seem to be an easy way
to run it. For instance I thought running something as below would be
optimal


UPSERT INTO DAILY_PUBLISHER_V4(segment_id, cross_segment_id, day, reach)
VALUES (SELECT segment_id, cross_segment_id, TO_DATE('2017-07-01',
'yyyy-MM-dd', 'GMT') as day, CAST(SUM(reach) AS BIGINT) as reach FROM
PUBLISHER_V4 WHERE day BETWEEN TO_DATE('2018-07-01', 'yyyy-MM-dd', 'GMT')
AND TO_DATE('2018-07-02', 'yyyy-MM-dd', 'GMT') GROUP BY segment_id,
cross_segment_id)

Hence I tried running the select statement to see if it works. I started to
see below error
Error: Task
org.apache.phoenix.job.JobManager$InstrumentedJobFutureTask@a66e580
rejected from org.apache.phoenix.job.JobManager$1@20de05e5[Running, pool
size = 128, active threads = 128, queued tasks = 5000, completed tasks =
5153] (state=08000,code=101)

When I changed the SELECT query to include a particular sid, the upsert
select worked.
Hence I think the only way would be for me to run UPSERt for generating
daily data for range of sids or segment_id.

Did I miss something?

On Tue, Nov 20, 2018 at 9:59 AM Thomas D'Silva <tdsi...@salesforce.com>
wrote:

> Since your PK already leads with (sid, day) I don't think adding a
> secondary index will help. Do you generally always run the aggregation
> query for the recently inserted data? The row timestamp feature might help
> in this case
> https://phoenix.apache.org/rowtimestamp.html
> If you run the same aggregate queries multiple times then another approach
> is to store the coarser daily aggregated data in a separate table that you
> can populate using an UPSERT SELECT.
> I'm not sure why the explain plan you attached has a CLIENT MERGE SORT,
> since you don't have an order by.
>
> On Fri, Nov 16, 2018 at 9:44 AM Monil Gandhi <mgand...@gmail.com> wrote:
>
>> Here it is
>> CLIENT 1-CHUNK PARALLEL 1-WAY RANGE SCAN OVER PUBLISHER_V4
>> [-3790006267705642038,'2018-07-01 00:00:00.000'] - [-37900062677056420
>> SERVER AGGREGATE INTO DISTINCT ROWS BY [SEGMENT_ID, CROSS_SEGMENT_ID]
>> CLIENT MERGE SORT
>>
>> Note: we have a dedicated phoenix query server
>>
>> On Thu, Nov 15, 2018 at 5:23 PM Geoffrey Jacoby <gjac...@salesforce.com>
>> wrote:
>>
>>> Monil,
>>>
>>> Could you please post the results of an EXPLAIN plan of your query? For
>>> directions how to do this please see
>>> http://phoenix.apache.org/explainplan.html
>>>
>>> Geoffrey Jacoby
>>>
>>> On Thu, Nov 15, 2018 at 5:02 PM Monil Gandhi <mgand...@gmail.com> wrote:
>>>
>>>> Hello,
>>>> Currently we have hourly data in our phoenix table. However, the schema
>>>> was designed to perform well for daily data. Increasing the number of rows
>>>> by 24X has lead to degradation of our service over time.
>>>> Our current schema is as follows
>>>>
>>>> CREATE TABLE IF NOT EXISTS T1 (sid BIGINT NOT NULL, day DATE NOT NULL,
>>>> cid BIGINT NOT NULL, s.count INTEGER, CONSTRAINT PK PRIMARY KEY (sid, day,
>>>> cid) ) COMPRESSION='SNAPPY'
>>>>
>>>> The query we run is something along the lines of
>>>> SELECT sid, cid, CAST(SUM(count) AS BIGINT) AS count FROM PUBLISHER_V4
>>>> WHERE sid IN (34535) AND day BETWEEN TO_DATE('2018-07-01', 'yyyy-MM-dd',
>>>> 'GMT') AND TO_DATE('2018-07-02', 'yyyy-MM-dd', 'GMT') GROUP BY sid, cid
>>>>
>>>> Based on our investigation we have concluded that the main reason is
>>>> purely the number of rows that are being read. I am open to other
>>>> suggestions
>>>>
>>>> If number of rows is the case
>>>> I am wondering if there is a way to either
>>>> 1. to roll hourly data to daily using views, secondary index or map
>>>> reduce. I know map reduce is possible.
>>>> 2. migrate to a newer schema where cid is not part of pk and is
>>>> actually a column family. I was unable to find any kind of documentation on
>>>> this.
>>>>
>>>> Thanks
>>>> Monil
>>>>
>>>

Re: Rolling hourly data

Reply via email to