Re: Data Model Suggestion

Vladimir Rodionov Tue, 23 Jun 2015 10:07:26 -0700

>From performance point of view, less columns is better when doing scan. The
more columns you have in a filter - the more comparisons HBase must perform
to decide if it needs skip or include particular cell into a result set.


-Vlad

On Tue, Jun 23, 2015 at 10:01 AM, James Taylor <[email protected]>
wrote:

> Perhaps I'm missing something, but I'm not sure how 24 additional
> KeyValue columns is going to help performance. Sid would be a good one
> to chime in, as he's done a lot of work with time series data for the
> new metrics framework of Ambari (which uses Phoenix underneath).
>
> Will you have queries that lookup an AD_INFO row based on AD_ID? If
> not, lead with AD_TS and salt your table. How about something like
> this?
>
> CREATE TABLE AD_INFO (
>     AD_TS DATE NOT NULL,
>     AD_ID BIGINT NOT NULL,
>     CONSTRAINT PK PRIMARY KEY (AD_TS, AD_ID) )
>     IMMUTABLE_ROWS=true, SALT_BUCKETS=64;
>
> (This assumes that you won't be updating a row in-place).
>
> You can run queries that bucketize by hour like this (over last 7 days
> worth of data):
>
>     SELECT count(*) FROM AD_INFO
>     WHERE AD_TS > CURRENT_DATE() - 7
>     GROUP BY  TRUNC(AD_TS, 'HOUR');
>
> If need be, you can add a functional index on AD_TS based on the hour
> of the day and Phoenix will maintain a separate table for you (and use
> it automatically when you're filtering on HOUR(AD_TS). This will have
> the same number of rows as the original, but the rows for the same
> hour of the day will be grouped together.
>
>     CREATE INDEX hr_of_day_idx ON AD_INFO(HOUR(AD_TS));
>
> If this doesn't perform well enough, then another alternative would be
> for you to maintain a separate rollup table manually.
>
> HTH. Thanks,
>
> James
>
>
>
> On Tue, Jun 23, 2015 at 8:38 AM, Puneet Kumar Ojha
> <[email protected]> wrote:
> > Yes …you are correct. Performance will be better with 24 columns. Better
> add
> > one more column(25th)  to keep daily aggregates for 24 hours. This will
> give
> > a better performance boost.
> >
> >
> >
> > From: Pariksheet Barapatre [mailto:[email protected]]
> > Sent: Tuesday, June 23, 2015 8:07 PM
> > To: [email protected]
> > Subject: Re: Data Model Suggestion
> >
> >
> >
> > Hi Puneet,
> >
> > Upsert could be easy -
> >
> > e.g. If we want to load data for 2015062412
> >
> > ---------------------------------------------------
> >
> > ADID          | COUNTRY       | HOUR ID      |  CF.IMP  |
> >
> >
> ---------------------------------------------------------------------------------
> >
> > 1                | US                  | 2015062412  | 2000        |
> >
> >
> >
> > Then upsert will be -
> >
> > UPSERT INTO myTABLE (ADID,COUNTRY,DAYID,IMP12)
> VALUES(1,'US',20150624,2000)
> > ;
> >
> >
> >
> > Your point is true that HBase will store 24 key-values. With one of ebay
> > blogs it says,  think your key value as sorted map.
> >
> >
> > SortedMap<RowKey, SortedMap<ColumnKey, ColumnValue>>
> >
> > If we refer this, I guess keeping 24 columns does make a difference
> versus
> > hour id in row key.
> >
> >
> >
> > Thanks
> >
> > Pari
> >
> >
> >
> > On 23 June 2015 at 19:10, Puneet Kumar Ojha <[email protected]>
> > wrote:
> >
> > Whether you keep 24 column .Hbase will still store it as 24 records only.
> >
> >
> >
> > There are benefits on use cases. Like breaking into column families , how
> > updates behave for a particular rowkey if you go with 24 columns.
> >
> >
> >
> >
> >
> > From: Pariksheet Barapatre [mailto:[email protected]]
> > Sent: Tuesday, June 23, 2015 7:07 PM
> > To: [email protected]
> > Subject: Re: Data Model Suggestion
> >
> >
> >
> > Thanks Skanda for your suggestion. If we go with 24 columns , won't it
> > reduce number of rows over large data set and will improve scan
> performance.
> > If I want to get daily report, I just have to scan only one row.
> >
> > Your thoughts?
> >
> > Thanks
> >
> > Pari
> >
> >
> >
> > On 23 June 2015 at 18:57, Skanda <[email protected]> wrote:
> >
> > Hi Pari
> >
> > For your use-case, having it as part of the rowkey should be a better
> design
> > than creating so many columns.
> >
> > Regards
> >
> > Skanda
> >
> >
> >
> > On Tue, Jun 23, 2015 at 6:49 PM, Pariksheet Barapatre <
> [email protected]>
> > wrote:
> >
> > Hello All,
> >
> >
> >
> > This is more like a HBase question but as I am planning to use Phoenix
> as a
> > access layer, I hope phoenix user will help me.
> >
> > I would like to create time series data to get on-the-fly analytics.
> >
> > This use case is for adTech.
> >
> > Report - what is houly,daily,weekly impression counts at country level
> for a
> > given advertisement ID (ADID).
> >
> > I am doing hourly aggregation and loading into a Phoenix table.
> >
> > Primary Key - ADID          | COUNTRY       | HOUR ID
> >
> >
> >
> ---------------------------------------------------------------------------------
> >
> > ADID          | COUNTRY       | HOUR ID      |  CF.IMP  |
> >
> >
> ---------------------------------------------------------------------------------
> >
> > 1                | US                  | 2015062301  | 3000        |
> >
> > 1                | US                  | 2015062302  | 3421        |
> >
> > 1                | UK                  | 2015062302  | 1212        |
> >
> ---------------------------------------------------------------------------------
> >
> > Is it a good schema design or shall I create alternate schema as below
> > Primary Key - ADID          | COUNTRY       | DAY ID
> >
> ----------------------------------------------------------------------------------------------------
> >
> > ADID          | COUNTRY       | DAY ID      |  CF.IMP01  | CF.IMP02  |
> >
> >
> ----------------------------------------------------------------------------------------------------
> >
> > 1                | US                  | 20150623  | 3000        | 3421
> > |
> >
> > 1                | UK                  | 20150623  | NULL        | 1212
> > |
> >
> >
> ----------------------------------------------------------------------------------------------------
> >
> > Here, I have taken hour part from hour ID and created 24 columns.
> >
> > I gone through many time-series NoSQL blog posts, most the author
> suggest to
> > go with wider rows as above. This will reduce the scan, but I don't see
> much
> > difference in both Data Models in term of latency for scanning.
> >
> > Can anybody please suggest good approach for my use case?
> >
> >
> >
> > Cheers,
> >
> > Pari
> >
> >
> >
> >
> >
> >
> > --
> >
> > Cheers,
> >
> > Pari
> >
> >
> >
> >
> > --
> >
> > Cheers,
> >
> > Pari
>

Re: Data Model Suggestion

Reply via email to