Perhaps I'm missing something, but I'm not sure how 24 additional
KeyValue columns is going to help performance. Sid would be a good one
to chime in, as he's done a lot of work with time series data for the
new metrics framework of Ambari (which uses Phoenix underneath).

Will you have queries that lookup an AD_INFO row based on AD_ID? If
not, lead with AD_TS and salt your table. How about something like
this?

CREATE TABLE AD_INFO (
    AD_TS DATE NOT NULL,
    AD_ID BIGINT NOT NULL,
    CONSTRAINT PK PRIMARY KEY (AD_TS, AD_ID) )
    IMMUTABLE_ROWS=true, SALT_BUCKETS=64;

(This assumes that you won't be updating a row in-place).

You can run queries that bucketize by hour like this (over last 7 days
worth of data):

    SELECT count(*) FROM AD_INFO
    WHERE AD_TS > CURRENT_DATE() - 7
    GROUP BY  TRUNC(AD_TS, 'HOUR');

If need be, you can add a functional index on AD_TS based on the hour
of the day and Phoenix will maintain a separate table for you (and use
it automatically when you're filtering on HOUR(AD_TS). This will have
the same number of rows as the original, but the rows for the same
hour of the day will be grouped together.

    CREATE INDEX hr_of_day_idx ON AD_INFO(HOUR(AD_TS));

If this doesn't perform well enough, then another alternative would be
for you to maintain a separate rollup table manually.

HTH. Thanks,

James



On Tue, Jun 23, 2015 at 8:38 AM, Puneet Kumar Ojha
<[email protected]> wrote:
> Yes …you are correct. Performance will be better with 24 columns. Better add
> one more column(25th)  to keep daily aggregates for 24 hours. This will give
> a better performance boost.
>
>
>
> From: Pariksheet Barapatre [mailto:[email protected]]
> Sent: Tuesday, June 23, 2015 8:07 PM
> To: [email protected]
> Subject: Re: Data Model Suggestion
>
>
>
> Hi Puneet,
>
> Upsert could be easy -
>
> e.g. If we want to load data for 2015062412
>
> ---------------------------------------------------
>
> ADID          | COUNTRY       | HOUR ID      |  CF.IMP  |
>
> ---------------------------------------------------------------------------------
>
> 1                | US                  | 2015062412  | 2000        |
>
>
>
> Then upsert will be -
>
> UPSERT INTO myTABLE (ADID,COUNTRY,DAYID,IMP12) VALUES(1,'US',20150624,2000)
> ;
>
>
>
> Your point is true that HBase will store 24 key-values. With one of ebay
> blogs it says,  think your key value as sorted map.
>
>
> SortedMap<RowKey, SortedMap<ColumnKey, ColumnValue>>
>
> If we refer this, I guess keeping 24 columns does make a difference versus
> hour id in row key.
>
>
>
> Thanks
>
> Pari
>
>
>
> On 23 June 2015 at 19:10, Puneet Kumar Ojha <[email protected]>
> wrote:
>
> Whether you keep 24 column .Hbase will still store it as 24 records only.
>
>
>
> There are benefits on use cases. Like breaking into column families , how
> updates behave for a particular rowkey if you go with 24 columns.
>
>
>
>
>
> From: Pariksheet Barapatre [mailto:[email protected]]
> Sent: Tuesday, June 23, 2015 7:07 PM
> To: [email protected]
> Subject: Re: Data Model Suggestion
>
>
>
> Thanks Skanda for your suggestion. If we go with 24 columns , won't it
> reduce number of rows over large data set and will improve scan performance.
> If I want to get daily report, I just have to scan only one row.
>
> Your thoughts?
>
> Thanks
>
> Pari
>
>
>
> On 23 June 2015 at 18:57, Skanda <[email protected]> wrote:
>
> Hi Pari
>
> For your use-case, having it as part of the rowkey should be a better design
> than creating so many columns.
>
> Regards
>
> Skanda
>
>
>
> On Tue, Jun 23, 2015 at 6:49 PM, Pariksheet Barapatre <[email protected]>
> wrote:
>
> Hello All,
>
>
>
> This is more like a HBase question but as I am planning to use Phoenix as a
> access layer, I hope phoenix user will help me.
>
> I would like to create time series data to get on-the-fly analytics.
>
> This use case is for adTech.
>
> Report - what is houly,daily,weekly impression counts at country level for a
> given advertisement ID (ADID).
>
> I am doing hourly aggregation and loading into a Phoenix table.
>
> Primary Key - ADID          | COUNTRY       | HOUR ID
>
>
> ---------------------------------------------------------------------------------
>
> ADID          | COUNTRY       | HOUR ID      |  CF.IMP  |
>
> ---------------------------------------------------------------------------------
>
> 1                | US                  | 2015062301  | 3000        |
>
> 1                | US                  | 2015062302  | 3421        |
>
> 1                | UK                  | 2015062302  | 1212        |
> ---------------------------------------------------------------------------------
>
> Is it a good schema design or shall I create alternate schema as below
> Primary Key - ADID          | COUNTRY       | DAY ID
> ----------------------------------------------------------------------------------------------------
>
> ADID          | COUNTRY       | DAY ID      |  CF.IMP01  | CF.IMP02  |
>
> ----------------------------------------------------------------------------------------------------
>
> 1                | US                  | 20150623  | 3000        | 3421
> |
>
> 1                | UK                  | 20150623  | NULL        | 1212
> |
>
> ----------------------------------------------------------------------------------------------------
>
> Here, I have taken hour part from hour ID and created 24 columns.
>
> I gone through many time-series NoSQL blog posts, most the author suggest to
> go with wider rows as above. This will reduce the scan, but I don't see much
> difference in both Data Models in term of latency for scanning.
>
> Can anybody please suggest good approach for my use case?
>
>
>
> Cheers,
>
> Pari
>
>
>
>
>
>
> --
>
> Cheers,
>
> Pari
>
>
>
>
> --
>
> Cheers,
>
> Pari

Reply via email to